lichess.org
Donate

Expected Score of Grandmasters based on Evaluation

well if SF were to keep or open their entrails to use control, one could actually have the centipawns right there in the simple evaluation function, and even restrict that machine champion to only use that even at its leafs, to get such notion.

It could also open the NNue fucntions input space and its ELO "worth" (still in engine chess tournament populations) coefficients known. Then one would have lots of metrics with inteligible board sense (except for the piece square, and all the search heuristics on any of those leaf eval. components I just asked to become open to use control, not forgetting go-depth, so we have a chess board control not a use hardware dependent time to compute non shareable parameter control, message to lichess in passing).

Centipawns would regain their meaning with simple only evaluation, even it deep evaluations in the search tree. (probably knowing the tree would also be a nice things to have to construct real human valuable metrics applicable to any position.

Add the tree with NNue and its board interpretable component made part of the user accessible output, and you could even have a further deeper simple evaluation predictor translated into the input space components themselves positionallly interpretatable now constrastible with the other tree in simple evaluatin mode. BTW. all the times people were thinking SF was playing positional, it was only doing that for 2% of its ELO score (soemthingh like that).. This is the new history of SF, since they went back and made a clean spearation between material counting (simple evaluation) and delegated the positional components to NNue training (still based on simple evaluation but moderate depth further, so the positional appearance in about converting static positional clues into predictors of even deeper material conversions). That is what people were seeing as looking like playing positionally. It was emergent and only about those that would eventually have material consequences, which mate is not really (well not of the 1,3,3,5,9) material counting kind....

So of course it makes sense to include the ratings of the 2 players in those conversion curves, but there might be a more complete multidimensional dissection of those usual wasted engine tools.... if we drop the orracle obsession.
Its interesting to see how is this relate to move number or a matric like how much material left because computer evaluation can give +1 for example in certain positions in the KID where statistically speaking black is doing fine
Did you consider to distinguish between material and positional advantage? Throughout the last years I have seen many GM games with only positional advantage of let's say +3 which after 2 inaccurate or bad moves was close to 0.0 again. On contrary a piece up usually wins the game.
@zeitnotakrobat said in #14:
> Did you consider to distinguish between material and positional advantage? Throughout the last years I have seen many GM games with only positional advantage of let's say +3 which after 2 inaccurate or bad moves was close to 0.0 again. On contrary a piece up usually wins the game.

I only took Stockfish's evaluation without any additional considerations. Distinguishing between these two types of advantage isn't as easy as it seems because taking hanging pieces and simple tactics should probably also count as material advantages. But then one has to determine if there is a tactic in the position and somehow define which tactics are simple enough to be counted as a material advantage.
In the past I used LC0's win, draw and loss probabilities to see how easy a position is to play. I plan on doing something similar with this data in the future and try to quantify how difficult a position is for humans to play.
@jk_182 said in #15:
> I only took Stockfish's evaluation without any additional considerations. Distinguishing between these two types of advantage isn't as easy as it seems because taking hanging pieces and simple tactics should probably also count as material advantages. But then one has to determine if there is a tactic in the position and somehow define which tactics are simple enough to be counted as a material advantage.
> In the past I used LC0's win, draw and loss probabilities to see how easy a position is to play. I plan on doing something similar with this data in the future and try to quantify how difficult a position is for humans to play.
What if you just skip the future tactics part and see how often gms win when they have a material advantage and an engine eval edge?
I wonder, if you look at the data near 0.0 does it flatten out? At least at engine level there is clearly a range near equality with expected score nearly at 0.5. Here is an old sf example from their wiki.
user-images.githubusercontent.com/4202567/206895934-f861a6a8-2e60-4592-a8f1-89d9aad8dac4.png
If you need a better formula maybe it would be worth looking at what SF uses for WDL github.com/official-stockfish/WDL_model

@ranibow_ghost said in #12:
> Its interesting to see how is this relate to move number or a matric like how much material left because computer evaluation can give +1 for example in certain positions in the KID where statistically speaking black is doing fine
Those are actually what SF has previously/does use for its own WDL output. Though +-1 doesn't really seem to move, the trend seems to be less confident for evals away from +-1 at higher material count.
@Craftyawesome said in #17:
> I wonder, if you look at the data near 0.0 does it flatten out? At least at engine level there is clearly a range near equality with expected score nearly at 0.5. Here is an old sf example from their wiki.
I didn't look closely at small advantages since I think that the data will be a lot more chaotic when dividing games into smaller bins because the sample size goes down. Also since White starts with an advantage of around +0.2 in the opening, this alone might lead to a flattening of the curve from 0 to +0.2.
> If you need a better formula maybe it would be worth looking at what SF uses for WDL github.com/official-stockfish/WDL_model
I looked at the Stockfish model in the past, but since it's calibrated for engine chess, I don't think it's applicable to grandmaster play.
@jk_182 said in #18:
> I looked at the Stockfish model in the past, but since it's calibrated for engine chess, I don't think it's applicable to grandmaster play.
Yeah, I'm not saying use it with the same coefficients. Those will be too sharp for humans. I'm just saying if there is a notable flattening for drawish positions you would need a different formula to model it.
Though I suppose the important thing sf does is model winrate instead of expected score.
As always, very interesting works. Howevere you should consider that your data may be biased, in more than a way.
One has been already pointed out, time pressure
Second one could be "sharpness of a position". By this, i mean, if you are in a +4 position, where everything is winning, i.e. no couterplay or no blunder chances, GMs will abandon/win 100% of the time; but a +4 position where you have to play 4-5 consecutive perfect moves to convert, GMs will defend till the end/ win less often.
Maybe there are other bias that should be worth considering to filter starting data and obtain a better sample.