Why Opening Statistics Are Wrong

Comments on lichess.org/@/d2d4c2c4/blog/why-opening-statistics-are-wrong/VKNZ1oKw

Adding a rating and time control filter to both sides will answer the more specific question of how suitable an opening is. It depends on how prepared your opponent is. The master games stats and stockfish eval are good enough for a quick look IMO.

D2D4C2C4

@amback9 said in #2:
> Adding a rating and time control filter to both sides will answer the more specific question of how suitable an opening is. It depends on how prepared your opponent is. The master games stats and stockfish eval are good enough for a quick look IMO.

yes if you want to make preparation very specific to the type of opponent, then rating control on both sides would be useful. The methodology I propose in the blog post essentially assumes that a 2100-rated player, for example, wants to prepare against the full distribution of opponents that similarly rated players typically face on lichess, without focusing on any subgroup of opponents.

To be honest, I have rarely looked at master games stats, but I'll look into it. Until a few days ago I was only looking at the lichess games stats, until I figured out I cannot rely on them so much.

D2D4C2C4

Upon a bit of further thought, adding some controls on the opponent's rating range would indeed help to make the statistics better. For example, if a 2100 player wants to create a white repertoire, he should want to see statistics based on white players within a range of his rating (such as 2000-2200, or even narrower) AND a range of opponents (could also be 2000-2200). That way we avoid results in small samples that may be skewed because of a few games where there is huge rating difference between the two

n1000

Nice! The database user wants to know the causal effect playing opening X has on their win rate but the rating window improperly controls for rating which acts as a statistical confounder, as it causes both the choice of opening move as well as win rate.

Your constructed example is excellent. I'm shocked at how big the bias is in the real data, too.

I saw that you limited yourself to January 2017. I put together a dataset sampling 1% of Lichess human blitz games, stripping out clock & evaluation info: www.kaggle.com/datasets/naddleman/lichess-blitz-subsample

It's a lot smaller and relatively easier to work and should at least be representative for blitz trends.

These sorts of opening questions are fraught with statistical pitfalls. If you look at the win-rate of the Grob, controlling for rating doesn't work because a 2000 who plays the Grob is someone capable of reaching 2000 playing 1. g4!

D2D4C2C4

n1000, this subsample is a really good idea! thanks for creating it and for pointing it out! I'll try to run exactly the same analysis on the subsample tomorrow or the day after. The truth is that the 2017 data are a bit outdated, because in the last 7 years there have been new trends, people studied openings that were played less in the past etc. So a representative sample that includes recent data is very useful.

You are touching on a problem which is much more difficult than the one I try to solve in my post. My idea was to replace grouping by average match rating, with grouping by white rating (for white's point of view) and black rating (for black's point of view), or perhaps for *both* as amback9 suggests.

This removes *some* bias, but leaves another elephant in the room, which is exactly what you are talking about. I'd put it this way: the methodology presented in the post will help us answer the question "what is the win rate of 2000-2200 white players when they play 1.g4?", but does not attempt to answer "what is the causal effect of playing 1.g4 on the win rate of a 2000 white player?". In other words, I'm trying to improve the descriptive statistics and shy away from causal analysis.

I have had some vague thoughts about how one could possibly isolate a causal effect. Say I regress win rate on playing 1.g4. I run into the problem that playing g4 depends on how good a player is (not sure which way this goes). Say I add a control for rating, I run into exactly what you just described in your last sentence. It seems like an impossible problem to solve.

I thought for a moment that perhaps chess960 ratings could be used as a control for general quality of play outside the opening phase, but I need to think of it more. Could an analysis of chess960 players when they play standard chess run into external validity issues? Difficult questions.

mkubecek

IMHO there is one more reason to focus on games where both players are within certain range, rather than either their average or one of them: the bigger the rating difference, the less role the opening plays and the more the results are mostly a result of the strength difference. With a rating difference of 400 or more, the stronger player can choose essentially any opening except of complete blunders without harming their chances significantly.

@n1000 said in #5:
> If you look at the win-rate of the Grob, controlling for rating doesn't work because a 2000 who plays the Grob is someone capable of reaching 2000 playing 1. g4!
More often it will be someone who plays Grob only against weaker opponents, knowing that they are less likely to be able to deal with it properly and even if they will, he/she will still should get enough opportunities to turn the game around later.

El_Alfil_Apacible

I'm an amateur in statistics, but I really want to thank you for a well-written and comprehensible report. Much appreciated!

mvhk

@D2D4C2C4 said in #1:
> Comments on lichess.org/@/d2d4c2c4/blog/why-opening-statistics-are-wrong/VKNZ1oKw
I wrote a month ago an article on my blog about how to read opening-statistics see schaken-brabo.blogspot.com/2024/08/statistiek-deel-2.html

You will see that I don't use the winrate at all.
I use 3 parameters to define which move gets my preference:
- Number of times a move has been played
- The average rating of the players for a selected move
- How much are players gaining or losing rating for a selected move

MyDeletedAcc

edited

#10

The idea to link the first moves of a game with winrates in order to find out how good they are in practical terms is kinda bold. Just a random example from my own winrates (without bullet): I score 50/7/43 in the position after 1.f3 e5 2.g3 d5 3.Nh3, but only 48/10/42 after 1.d4 Nf6 2.c4 e6 3.Nf3. I played both positions around 650 times with white so the sample isnt that small and comparable. Now what can be concluded? That the 1.f3 line is more promising cause of the slightly better rates? I would make the point that I just have more experience than my opponents in the first variation and thats why I perform well and i would say your results show that too. Uncommon Openings score better not cause they are good and everyone should play them, but cause they are uncommon and players who play them more often than their opponents have an advantage of experience.

Edit: I should add the average opponents rating of the two positions I gave are also similar 2484 and 2491