Last week, I went over several papers in the area of Biased Inverse Reinforcement Learning and Fairness Criteria. I also brainstormed several possible next steps for research. After talking with CJ on Monday, we decided that the second option was the most promising – extending the ideas behind this paper by Shah et al. to more the human domains of chess and poker. As a reminder, this paper’s goals were to automatically learn how agents can act using a biased policy (a systematic pattern of deviations from optimality) in a gridworld domain, without any assumptions on the specific nature of the biases.
During the week, I did a scan of the literature to see what prior research had been done on biased players in chess and poker. I found a few papers on how computers can be programed to be biased, but it wasn’t especially relevant. I found one paper, Playing Chess at a Human Desired Level and Style, which was interesting, but they used word “style” only in the context of playing human-like vs computer-like, while we need a much broader sense of the word. Overall, I didn’t find anything super useful on chess, so I’m on my own here (but that’s research!).
I also read several papers on poker, and learned quite a bit about the metagame of top level poker. However, poker seems like a much harder domain to explore suboptimal moves in than chess, due to my lack of domain knowledge, and the fact that suboptimal moves can actually be the best move to play in some scenarios (e.g. bluffing). For this reason, I’m probably going to focus on chess for my research this rotation.
Chess does have its own issues as a domain – it has a huge state space estimated at 1046, meaning that even with a very large database like Chessbase or Lichess, only a tiny amount of states past the opening will share states, making it difficult to compare different agents against each other. The state representation itself is also very large, since we have 64 squares on the board, and up to 16 pieces that we need to place on the board, along with other factors like en-passant options, castling, etc.
Additionally, the notion of reward is much harder to articulate in the chess domain over the gridworld domain. Generally, we would like to say that a human would like to maximize the evaluation or expected score of their position. We can use a computer to measure the evaluation of the position, but we would ideally like to relate the moves themselves to the evaluation to learn the human biases. For example, in a certain position, White might have two moves leading to a +1.5 evaluation, a piece sacrifice to start an attack, and a normal capture leading to enemy doubled pawns. We want to differentiate between the two moves to show that one player would be aggressive and the other positional, but just using an FEN string and a single numerical evaluation isn’t helpful for this.
I met again with CJ on Friday to update him on my progress and discuss these issues. He suggested that the key to solving these issues would be a good representation of the state space. I came with an idea to can compress an entire board state into just a few numbers that would drastically reduce the state space complexity and also provide some more useful information describing the priorities of an agent. A big plus of this approach is that it can be transferred to any other domain – we aren’t learning how an agent is biased towards certain chess positions and principles, we are learning how the agent is biased towards tuples of numbers, something that we can do regardless of domain. The negative would be that this representation (for chess) becomes extremely non-deterministic, as knowing what move is played gives essentially no information on what the next state, even for an optimal agent. CJ said this is fine though, and we can just treat future state spaces deterministically.
With this in mind, the next step was to actually figure out the compression method – how do we take a board state and turn it into a list of numbers? CJ reminded me during the meeting that the specific compression method doesn’t really matter for the purposes of the rotation as long as it works, but I still wanted to find something that I thought was actually meaningful. My first idea was to use my own domain knowledge to create hand-crafted statistics about the position, calculating things like material imbalances, number of pieces, space, etc.
As I was thinking about this, I realized that engines like the open-source Stockfish actually do this component breakdown as part of their overall evaluation method, and I could just borrow the evaluation terms for my own use. Since Stockfish doesn’t provide this breakdown as part of the normal use, I couldn’t just use the standard Python modules like python-chess or Stockfish. After some reading (and help from the Stockfish discord), I was able to get the breakdown by using the Popen module in Python to open a Stockfish executable process and pass in the relevant input. Link to code repo.
Currently, I have three main functions – evaluate(), get_best_move(), and get_next_states(). evaluate() takes in a FEN position and an optional next move parameter, and returns a tuple for the position, showing the component breakdown given by Stockfish. In order to reduce the state space, I’m only saving four of the values, and not using anything from the endgame values. I’m also rounding each value to the nearest tenth (instead of hundredths) and constraining all values to [-2.0, 2.0]. get_best_move() takes in an FEN position, and returns the optimal action given by Stockfish. get_next_states() takes in an FEN and returns a set of all next (compressed) states.
My next steps will involve using a large database (like lichess) to construct a comprehensive state space using the compressed board positions. Not only can I use the optimal moves in each position (which are already given in many of the games), I can use actual chess player’s games to calculate agent trajectories, instead of having to construct a virtual agent and run samples on it. There are a lot of questions on how viable this approach will be, in terms of storage space and runtime complexity. I’m also still a bit unsure on how the rewards will be represented in this space. I’ll have to take a deeper look into the original paper’s code repository, and I’ll also schedule another meeting with CJ soon to discuss more.
See you all next week!