This is a side project to test how sensitive LLMs that play chess are to non-semantic factors (other than the position on the board). Currently it only supports the OpenAI text completion API, and has only been tested on GPT-3.5-turbo-instruct.
Adapted from Nicholas Carlini's repo and blogpost.
LLMs that play chess (currently mostly GPT-3.5-turbo-instruct) currently take the game state as a PGN, and predict the next move. For example, to respond to a Sicilian , one can prompt the model with
[White "Garry Kasparov"]
[Black "Magnus Carlsen"]
[Result "1/2-1/2"]
[WhiteElo "2800"]
[BlackElo "2800"]
1. e4 c5 2.
and if it predicts Nf3
, it means the engine should return the move Nf3
.
Turns out this engine can play chess pretty well (beating the 1800-rated Stockfish handily),
and model the board state extremely well, almost never making illegal moves.
(It helps to put strong players in the PGN header, because we can ask for generalization.)
The fact that the model does not take the current position, but rather the entire sequence of moves, as input, has interesting implications: the model's "thinking" in a given position might vary significantly based on features beyond the current board state.
For example, what does the model play for White in the position 2rq1rk1/1p2ppbp/p1npb1p1/8/2P1P3/1PN5/P2BBPPP/2RQ1RK1 w - - 0 1
here?
Turns out, it depends on how you arrive to it. If the moves are
1. e4 c5 2. Nf3 Nc6 3. d4 cxd4 4. Nxd4 g6 5. c4 Bg7 6. Be3 Nf6 7. Nc3 O-O 8. Be2 d6 9. Nf3 Ng4 10. Bd2 Qb6 11. O-O Nge5 12. Qb3 Qd8 13. Nxe5 Nxe5 14. Qc2 Be6 15. b3 Rc8 16. Rac1 a6 17. Qd1 Nc6
you get Be3
; whereas if the moves are
1. d4 Nf6 2. c4 g6 3. Nc3 Bg7 4. e4 d6 5. Nf3 c5 6. Be2 cxd4 7. Qxd4 Nc6 8. Qd1 O-O 9. O-O Nd7 10. Bd2 Nce5 11. Nxe5 Nxe5 12. Qc2 Be6 13. b3 Rc8 14. Rac1 a6 15. Qd1 Nc6
you get Nd5
.
The probabilities aren't even close. Both games are pretty natural, maybe not top-level, but not obviously bad.
There is a tool called proofgame in the texel repo that returns a PGN for a given FEN.
We use the option -f
and set the last two entires of the FEN to 0 1
, because we only care about the position, not the number of moves it took to get there.
The games it constructs are quite unnatural, and are usually a bit longer than the original game that led to the position.
As in Carlini's repo, we test the model on puzzles and report the accuracy, only here we report both the accuracy on the original game that led to a puzzle, and the accuracy on the constructed game that led to the same position.
generate_pgn_puzzles.py
to download a bunch of puzzles and corresponding games from Lichess.pgn_to_fen.py
to get a nice file of FENs.proofgame.py
to generate proof games. Depending on the size of the dataset, this is somewhat slow.make_pairs_puzzles_dataset.py
to generate a dataset of puzzles and their solutions.puzzle_pair_solver.py
to compare how the model performs.(The above is complicated because it's WIP, but it works. Eventually it should be a single script.)
This project has minimal dependencies so far: python-chess, litellm, cachier (for caching responses, it's possible to turn it off).
pip install -r requirements.txt
The texelutil binary (version 1.10) that works on x86-64 Ubuntu is included in this repo.
If you are on a different platform, you can get it from here.
In some cases you will need to compile it from scratch, which is easy; just follow the instructions in the texel
repo.
Put your key in a file called OPENAI_API_KEY
.
Humans, certainly. Players may adjust their strategies based on whether their opponent is playing well or poorly. However, I would bet the effect is more severe in current LLMs, even when accounting for player rating: players of similar strength as GPT-3.5-turbo-instruct should be affected much less.
As for other engines, technically yes, but in practice not that much. Stockfish caches some previously explored states during the game, which informs its initial search each time a new move is played. Nonetheless, given sufficient time to search, it should explore a similar set of positions regardless of how it arrived at the current one.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.