Pacto Visual

What is Complexity?

30 Nov 202117,736 viewsEnglish (US)

If we all knew, we would all be masters.

May 2020 an interesting proposal was suggested.

I provided some constructive criticism on research paper A Metric of Chess Complexity by FM David Peng, as well as constructive criticism on the codebase used to validate this experiment. For many months I have refrained from further comment, and although code has not progressed, two things have:
1. Public interest in "complexity" as determined by ACPL (yuck).
2. Lichess has a blogging platform where I can properly address deficiencies in the research method and control the conversation which I start.

... so the time has come for me to share my remaining criticisms of this ambitious project. Previously I had shared some easier-to-address criticisms while privately I shared Peng's suggestion with the Lichess team.

The Golden Goose

"Such a feature has the potential to revolutionize chess and would be invaluable to any chess website. Some specific applications include generating non-tactical puzzles (imagine tactics trainer for positional chess puzzles), creating chess computers that play with human personalities, and identifying concepts that are key to improvement at any rating level."

Science is a window for us to learn more about the world around us. Marketing is about selling ideas to an audience. This statement, if true, would have already garnered interest by both scientists and business people, who by exerting a modicum of effort could easily develop and sell products based upon them. Further, if true, this could also inspire a black market of cheating software to help players identify the risk associated with cheating in particular positions. Peng's paper makes many similar promises to the above, so this raises the level of scrutiny I take to the rest of the paper.

Propositions

This paper specifies complexity in two different propositions:
a) Complexity is a 1-dimensional, transferable (teachable to a neural network) metric based upon centipawn loss as determined by some version(s) of Stockfish with or without a neural network.
b) By definition, complexity can be used in real time to determine how difficult a position is.
While some people's intuitions may support the notion that these propositions support or complement each other, I am unconvinced; regardless, it takes more than intuition to create useful tools.

Logic

Even if the above axioms were true, how many of these conclusions are logically valid?
1. Non-tactical puzzles could be generated by identifying challenging positions (as opposed to the current method which is based upon positions where the solution is superior in evaluation to other variations).
2. The current model for puzzle ratings (based upon "elo") takes many attempts to establish an initial puzzle rating.
3. Holistic opening preparation can be automated by software, making players understand openings rather than memorize them.
4. Interesting positions for books are the same as difficult positions, which are the same as complex positions.
5. By identifying positions which are difficult for low-rated players and easy for high-rated players, one could develop training materials to help players understand common key concepts.
6. By identifying positions which are difficult for low-rated players and easy for high-rated players, one could develop a diagnostic chess exam which identifies a player's rating and identifies key concepts for improvement.
7. Large databases contain additional tagged information, such as time control, which would produce significant insight into which positions can be played intuitively. Large databases also indicate player ratings and therefore somehow complexity can be used to identify unique strategies useful for playing at a rating advantage or disadvantage.
8. Chess players have human personalities.
9. Opening systems can be devised around an opponent's tendency to seek or to avoid complexity.
10. Chess players are likely to make errors in difficult positions, unlike engines, and therefore a complexity metric would be an invaluable tool.
11. Spectating (and honestly, post-game analysis) of top chess games could be enriched by displaying complexity information related to each position, informing spectators who otherwise look at engine evaluations and variations & jump to conclusions.
12. Complexity varies by variant; for example blitz and correspondence have different complexities for identical positions.

In my opinion, conclusion#11 is valid and others require further research. Anyway, on to Peng's research...

Neural Networks

This paper nearly predates efforts by DeepMind, the Leela Chess Zero team, and the Stockfish team which resulted in development of Stockfish-NNUE. We could not have anticipated such rapid developments! Many chess players had opinions that AlphaZero and Leela played much more human-like moves than traditional engines, in much the same manner that decades prior world champion Kasparov was astounded that Deep Blue played human-like sacrifices. Whatever conclusions are drawn may need to be updated since both Stockfish evaluations without NNUE, and Stockfish-NNUE evaluations, have rapidly changed (complementing Stockfish search changes and search parameter changes).

Endgame Scaling

Stockfish evaluations in the middlegame are capped at 10 and in the endgame are capped at 100. As such, it seems unreasonable to deviate from prior research indicating the need for a sigmoid to normalize evaluations before classifying example input moves as blunders.

Board Representation

Doing original research allows liberties in methods and models, although considerations offered here differ from those announced and discussed in public interviews by DeepMind's CEO. While I don't fully agree with DeepMind's emphasis on asymmetry and castling rights, I do question the need for an extra bit for White/Black to move. For example, the positions after 1. c3 e5 2. c4 (Black to move) and after 1. e4 c5 (White to move) should have the same relative evaluation.

Evaluation Skewness

There is ample prior research about ranking moves. In fact, Peng's research here is predicated on a notion that traditional engines sometimes indicate to spectators that two moves are equally good, despite one resulting in a more difficult position than the other. We cannot be fully certain that players in fact played the best moves, as this is the very concept we are trying to figure out how to measure! Regardless, we have to start somewhere, and this seems like a good first attempt.

Summary

I could further nitpick... I criticize because I am impressed and I care about this subject. I am further impressed that results were split by "elo," leading to a discovery that some positions are difficult for all players, whereas some positions are more difficult for lower-rated players than for higher-rated players.

Other possible improvements could involve:
* Obtain segmented Stockfish evaluations (material, pawn structure, etc.) and WDL statistics
* Obtain Stockfish-NNUE evaluations and WDL predictions
* Model checks in sample input
* Model log(time remaining) in sample input
* Maybe bootstrap models based upon known pawn concepts
* Maybe include some human versus engine games. Some bots such as Boris-Trapsky and TurtleBot have personalities!

Thanks for the suggestion and someday I hope to see Lichess.org or some other site implement a complexity metric before cheaters do.

Photo credit: Pacto Visual