lichess.org
Donate

By Marcin Wichary from San Francisco, U.S.A. - Turk, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=21744454

Understanding Chess Engines' Evaluations

ChessChess engine
This article will help the reader understand how to correctly interpret chess engine evaluations.

Donations are much appreciated. All of my content and writings will be free forever according to the "Truthdoc code" (to be published some time next year) - help keep such content going strong :). Donate


Chess engines: feared by all, benefited from by a few. In this article, I hope to be able to change this trend. I believe that while advanced players, and especially professionals, make good use of engines, many amateur players misuse engines, putting their chess progress in great jeopardy. In fact, I believe some amateurs are actually actively ruining their chess by their misuse of engines. Seeing as how engine bars are becoming ubiquitous in chess commentary, I believe that this post must be read very carefully - now more than ever.

Note 1: As engines continue getting stronger, the exceptions I talk about in this blog post will become increasingly rare. In fact, I hope that engines improve so much that this blog post becomes obsolete! Actually, I wrote this blog post in 2015, and I had to update a lot of positions because of how strong engines have become - SF + NNUE is now a formidable beast whose evaluations can rarely be doubted.

Note 2: I exclusively used Stockfish 14 (NNUE) in the analysis so that evaluation comparisons would be much more accurate.

Note 3: This post was originally meant to be comprehensive, but I think that most chess players will be able to confirm my conclusions without me having to support them with over 50 experiments.

Note 4: The article needs to be read in its entirety because some concepts are clarified in multiple sections. I really hope you enjoy the content of this article!

Note 5: Please read the take-home message at the end of the article.

Understand What They Want

Other than wanting to participate in some sort of Terminator-like apocalypse, chess engines also want to play the best moves. However, despite these machines' brilliance, they are not intelligent (yet?!), so there must be a way for us to know what exactly the machine is thinking.

Enter the concept of evaluations. It is important to note that the ability to make precise measurements is a very essential concept in science - the evaluation function, with its brilliant simplicity, offers precisely such a means of numerically judging a chess position.

Here's how it works: The computer, using its raw processing power, will create numerous possible board positions. Once those positions are created, the computer needs to know which one is best; to do that, it assigns a specific number to the board position. This number tells us human players who is better and by how much. For the engine, it has to pick the highest-scoring number, maximizing its chances to win.

If the engine spits out a +3 evaluation, it thinks that White is ahead by three pawns (the "+" sign is used to denote that White is ahead, while the "-" sign is used to denote that Black is ahead). What does this mean? It means that whatever the position on the board, White is effectively up a minor piece*. Even if Black is actually up a minor piece, when the computer gives an evaluation like "+3," it thinks that White's initiative or long-term compensation is so strong that White is effectively up a minor piece.

Now, despite all their merits in helping humans assign a numerical value to positions, evaluations aren't perfect. I will explain my position with supporting evidence later, but for now, here are some important ideas concerning evaluations:

Generally, the interpretation of evaluations is as follows:

0 ---- +-0.3: Equal, or no side has any real winning chances - only slight propensities to win

+-0.31 --- +-0.7: White / Black is slightly better, or White / Black has good chances to play for a win

+-0.71 ---- +-1.5 : White / Black is (much) better, or White / Black should be able to win

+-1.5 ---- mate in X number: White / Black is winning, or White / Black must win from the position in question

These are also the same interpretations used by ChessBase and other chess GUIs.

Important note: These interpretations are different for different engines and even for different versions of the same engine. Nowadays, the trend has been toward more agreement in terms of rough evaluations between the engines, but +1 on Komodo almost always doesn't mean the same thing as +1 on Stockfish. The reason for this disagreement is that different engines have different evaluation functions: for example, in one engine's positional evaluation function, a knight on the rim might incur a -0.15 penalty, while for another engine, a knight on the rim might earn a -0.20 penalty. That being said, these rough evaluations are quite accurate and very helpful in general. Seeing as how Stockfish remains the most popular engine (hoorah for open source!), these interpretations are very Stockfish-centric.

I should also note that the new neural network chess engines make use of a probability score (70% chance that White will win, for example), though that's usually converted to the standard numerical evaluations.

Equal Positions

Evaluations in the realm of + or - 0.3 aren't too helpful. Basically, they're not all that accurate; they are considered as "equal," as in, closer to 0.00 than not. Now, the fact that they are called equal isn't too meaningful as well - actually, a +0.3 position does indeed reveal some things: White is potentially better.

Usually, in a +0.3 position, White has more room for error than Black. A very small inaccuracy could make a +0.3 position for White a +0.1 position - still equal. If Black makes a small inaccuracy in a +0.3 position, though, the coming position might be a +0.5 position - White would then be slightly better.

However, it is very important to know when the evaluations are taking place: +-0.3 in the opening doesn't mean much, but in the middlegame, said evaluation's potential to predict which side will eventually be better / which side has more leeway for errors is decently strong.

Consider the Caro-Kann Defense:

https://lichess.org/study/S8wAvJ46/MSWeb6YG#2

At depth 55, Stockfish 14 gives White a +0.1 advantage. That doesn't mean that the Caro-Kann is a forced draw, or that White's tiny advantage gives them the better chances. The Caro-Kann is just a very good opening that's playable for both sides and for humans and engines alike! Therefore, don't read too much into opening evaluations.

For the middlegame, consider the following position:

https://lichess.org/study/S8wAvJ46/8VDfWGLM#21

Here, the evaluation is equal, or +0.3. However, seeing how in practice White's winning percentage is much higher than Black's (with, at the time of writing, the drawing percentage being equal to both combined - 50% draws, 34% White wins, 16% Black wins), this evaluation doesn't simply mean staid equality. It means that the position is equal, but White has good chances, or at least more chances than Black, to play for a win. For Black to play for a win, they must first achieve actual equality (0.00) and then to start steering the game into their favor.

As for the endgame, +-0.3 is basically completely meaningless.

https://lichess.org/study/S8wAvJ46/5DqwUN2f#50

Does anyone really believe that White has a slight edge here (+0.3, depth 36, Stockfish 14), or any real winning chances? Of course not! This is a completely drawn position - the opposite-colored bishops should dispel any doubts about White's potential "winning" chances.

Slightly Better Positions

For slightly better positions (+-0.31 --- +-0.7), the evaluations are usually quite accurate.

In the opening, "slightly better" evaluations can help differentiate between "professional" and "amateur" openings. Most mainline openings played by professional players are in the "equal" range. Of course, they don't give immediate equality for Black, else White would not play them; at the same time, though, they don't give White a huge advantage, else Black would not play them.

Consider the Ruy Lopez, for example:

https://lichess.org/study/S8wAvJ46/wTm2opnC#5

The Ruy Lopez is one of the most popular openings, especially at the top level. At depth 56, Stockfish 14 gives a +0.2 evaluation - "equality." On the other hand, let's consider:

https://lichess.org/study/S8wAvJ46/ri4jMtAT#2

This is the Nimzowitsch Defense, a reasonable opening for Black, but one that is almost completely absent in top players' games. SF 14 at depth 50 gives a +0.4 evaluation - slightly better for White.

In a similar vein, in the middlegame, any evaluation of "slightly better" is usually quite correct.

Take the following position:

https://lichess.org/study/S8wAvJ46/aNDTlvmn

At depth 42, SF 14 gives a +0.59 evaluation. For a human, it is clear to see why that is the case: White is ahead the exchange, but as compensation, Black can eye White's exposed king and potentially doubled pawns on the c-file. All things considered, White should be slightly ahead.

In contrast, in the endgame, positions engines claim are slightly better are a bit touch-and-go.

https://lichess.org/study/S8wAvJ46/tPa54Ca4

Here, SF 14 at depth 50 gives a -0.37 evaluation. Does this mean that Black is slightly better in the sense that they can convert such an advantage to a win? In human play, probably, but I think that objectively speaking, this position is drawn. Of course, to get to the truth of this position, one has to conduct very lengthy analysis, and I leave this exercise to the ever chess-hungry reader!

Much Better Positions

In the opening, +-0.71 ---- +-1.5 evaluations are reasonably accurate at predicting which side has a close to decisive advantage.

https://lichess.org/study/S8wAvJ46/7zY618kz#6

The Stafford Gambit?! Yes, at depth 52, SF 14 gives White a 1.4 advantage. This is not at all surprising because it is unclear what Black's compensation for the pawn is (or why they should be better in the first place). The power of the Stafford Gambit lies in its surprise value, not its objective strength. Therefore, SF's +1.4 can be seen as being "correct" - White is much better.

Now take a look at the following opening:

https://lichess.org/study/S8wAvJ46/yo5jPHIB#9

I may be introducing some bias by including an opening I play, but there's good reason for my inclusion of this opening. The Czech Benoni leads to a very closed position which is difficult for engines, neural engines or not, to evaluate. Also, engines like SF are known to overestimate a space advantage, one which White has from the get-go in this opening. Therefore, it is far less likely that the "much better" evaluation SF gives White against the Czech Benoni is correct.

As for the middlegame, unless the positional is sufficiently complex, evaluations in the range of +-0.71 ---- +-1.5 are quite accurate.

https://lichess.org/study/S8wAvJ46/XAWjvdO8#29

White has a very powerful knight on c6, controls Black's counterplay on the queenside, and can open the center and kingside to their advantage at a moment's notice. Therefore, White is definitely much better and should go on to win the game.

However, in very complex middlegames, a +-1 evaluation could be meaningless because there is some sort of forced sequence that leads to an objectively drawn position (that the computer misevaluates as being +-1).

https://lichess.org/study/S8wAvJ46/ObhUnmSu#28

This is a position that occurred between Stockfish and Komodo in the TCEC Season 5 Superfinal. I have deeply analyzed this position and the game in its entirety (and will publish the analysis in a later blog post!), and I can confidently say that SF 14's +1.43 at depth 39 does not mean much - after the dust settles, Black will reach a rather drawish position. I've included the forced line - the reader is encouraged to discover why it is forced.

As for the endgame, computers evaluate many drawn positions as +-1 because they basically only count the material when they see that there is no way to make progress. For example, some completely drawn opposite-colored bishop endgames where one side is ahead by a pawn are evaluated as +-1, but it is clear to any human player that the game will most certainly end in a draw. Having said that, the new neural network-boosted versions of Stockfish give a much smaller advantage to these types of opposite-colored bishop endgames and actually reach the "stability point" very quickly (more on that later). Of course, the "endgame effect" does not solely occur in overly simplified positions. Endgames are also breeding grounds for fortresses, so without human assessment, an engine would consider many endgame positions to be slightly better for one side, but reality would reveal a draw as assured as a draw arising from stalemate.

Let's look at an example:

https://lichess.org/study/S8wAvJ46/5XQ87Pqj#0

At depth 44, SF 14 gives a +0.75 evaluation. In fact, this position is a tablebase draw!

However, these exceptions are rare, and "much better" in the endgame usually means a victory for the "much better" side.

Winning Positions

+-2 or +-3 evaluations generally don't suffer from the endgame effect and are basically quite accurate (with what the engine deems as "perfect play," White / Black should be able to win when given those evaluations). In some instances, +-2 or even +-3 evaluations suffer from the endgame effect, but those occurrences are extremely rare. Any higher than +-3 and the win is basically guaranteed. One could even argue for the strong hypothesis that, say, we present a position with a +4 evaluation to two engines and give the normal engine White and the "perfect play engine" (an engine that has solved chess) Black, the "weaker" engine should always be able to win even against perfect play. In the opening and middlegame, +-2 positions are basically always wins, with what the engine deems as perfect play, for the stronger side.

Let's look at a few examples:

https://lichess.org/study/S8wAvJ46/5hoiDtks#4

The Damiano Defense has long been known to be lost for Black, and, indeed, SF 14 at depth 43 gives a +1.6 evaluation.

https://lichess.org/study/S8wAvJ46/MAhuFCuK#2

Another opening that's known to be lost for Black: the Duras Gambit.

Now let's take a look at a Sicilian middlegame position:

https://lichess.org/study/S8wAvJ46/auqTNgCL#40

Black is ahead by a pawn, but SF 14 at depth 33 gives a +4.38 evaluation. The reason is that White has a monstrous attack; yes, material-wise, White is behind, but White's attack is so strong that in the future board positions the engine has analyzed, White will end up being effectively up a minor piece and a pawn (+4 = 4 pawns = ~a minor piece + pawn).

As for the endgame, let's look at this position:

https://lichess.org/study/S8wAvJ46/P9ZsFpJV#98

Here, White should be ahead by ~1.5 pawns (2 bishops = ~6 vs. 1 rook = ~5; bishop pair worth ~+0.5). However, SF 14 at depth 34 gives a +10.10 evaluation. The reason is that all of White's pieces, especially White's king and d6-bishop, are very strongly placed; as the game showed, Black's pawns were picked off one by one and White eventually won the game. For those interested in the game continuation, I have included it in the study (Ivan Zemlyanskii vs. Ashvath Mittal)

Again, in these "winning" positions, almost always, we can trust the engine's evaluation. Some counterexamples will be included in the second section of this article.

The Stability Point

To better understand computer evaluations, I came up with something I call the "stability point." The meaning is simple enough: there comes a point where the engine's evaluation simply isn't "improving;" at that point, the engine's evaluation is said to be stable. If the engine spends more time, achieving higher depths, and still gives out approximately the same evaluation as the previous depth, the engine is at its stability point and further increases in time spent analyzing the position won't be very beneficial. Let's consider an example:

Say you have a certain position you want to analyze and the engine gives an evaluation of +1.2 at depth 25. At depth 27, however, it gives an evaluation of +1.8. At depth 31, it even gives an evaluation of +4.2! At depth 33, though, it gives an evaluation of +4.3. At depth 34, it's back at +4.2. Thus, we can say that at that point in time, the engine's evaluations have stabilized. It is rare for an engine to actually change its evaluations much after it has reached a ~3 ply difference of stability (that is, its evaluations have remained almost constant for over three increases in ply count), so most analysis past the point of stability is effectively useless.

I should also add that there is a "symbolic evaluation" at the stability point. If the engine gives +0.23 at depth 50 and +0.23 at depth 70, the position is likely drawn and the evaluation at the stability point is said to be a symbolic evaluation.

Important note: Some evaluations might stabilize at very low depths but later change at higher depths. That is why the stability point cannot be considered at very low depths or at depths which are reached in a very short period of time (say, under 5 minutes). The stability point occurs when the engine's evaluations have stabilized after running for a reasonable amount of time (5+ minutes on any hardware).

Now for a more concrete example:

https://lichess.org/study/S8wAvJ46/b9qyxNEn#0

Look at the following evaluations by SF 14

Depth 39 /// -0.57 /// Ng4
Depth 46 /// -1.06 /// Ng4
Depth 53 /// -1.07 /// Ng4
Depth 56 /// -1.07 /// Ng4
Depth 58 /// -1.07 /// Ng4
Depth 60 /// -1.07 /// Ng4
Depth 62 /// -1.08 /// Ng4
Depth 72 /// -1.08 /// Ng4

At lower depths, SF's evaluations hadn't yet stabilized and further depths needed to be reached. However, at depth 53, it started becoming clear that SF was close to its stability point for this position, and further increases in depth only confirmed this hypothesis.

What are some other implications of the "stability point"? For the opening analysts out there, that means that having an engine analyze a certain position when its evaluations have already stabilized is pretty much useless - that's as far as the engine will see. The rare cases when the engine does see differences in its evaluation past the stability point do not warrant the very high amount of time needed to run the engine past its stability point.

If you're not satisfied with the engine's evaluations at the stability point, you either need to improve the engine's evaluation algorithms (get a stronger engine, or contribute to / make one yourself!), or you need drastic improvements in hardware that can help you achieve much higher depths.

*Bishops and knights are traditionally equal to three pawns, but these values are obviously not set in stone; Larry Kaufman, in his brilliant article, "The Evaluation of Material Imbalances in Chess," gives the minor pieces a value of 3.25

A Win? - Why Computers are Liars

We've touched on this subject in the previous section. Basically, certain evaluations which give an advantage to either side may actually not be accurate at all.

Example 1 (Stockfish 14)

Let's see an exception to our rule (the "+-3 rule," which states that evaluations above +-3 are basically always winning):

https://lichess.org/study/S8wAvJ46/zgHYesGj#1

In this position, White, a former world #3, Salov, constructs a fortress that stops Korchnoi's advances. At depth 49, though, SF 14 gives a -5.45 advantage - completely winning? No! This is a fortress position in which Black cannot make any progress. All White has to do is keep the rook on the 5th rank and be careful not to allow a liquidation into a winning pawn endgame were Black to sacrifice their queen.

Also, unfortunately, the engine never reaches a stable evaluation, continuously shifting from -4's and -5's.

Example 2 (Stockfish 14)

https://lichess.org/study/S8wAvJ46/PuNeSxFC#0

-11.50 at depth 42? Preposterous! The position is completely closed and the position is just drawn.

Example 3 (Stockfish 14)

https://lichess.org/study/S8wAvJ46/VVYaaedi#0

How could I not include this?! Also, allow me to advertise my own study :)

https://lichess.org/study/fmDBF2kw

Instead of misevaluating a winning position as drawn, SF misevaluates which side is winning!

Again, the exceptions to the +-3 rule are quite rare. Furthermore, fortress positions are pretty much never given the correct evaluation by the engines - a +5 evaluation could very well mean that White has no way of making any progress. Again, in such endgames, the stability point would make it clear whether there's a win or not (say, if the evaluation remains at ~+4.94 for over 30 increases in ply count).

I encourage the readers to present the most difficult-to crack and unbelievable fortresses they know of!

Depth Woes - Searching in the Mariana Trench

Arguably, the "depth matters" concept is one of the most important pieces of information you need to know about chess engines. There comes a point where increasing depth leads to diminishing returns, though, but on average PC hardware, it takes lots of time for the evaluation of a position to stabilize. Therefore, if you're doing some analysis on your computer, more often than not, it's best to let the engine run for quite some time for it to come as close to possible to the "truth" of the position.

Consider the following position:

https://lichess.org/study/S8wAvJ46/CjOGn7AM#0

Depth 26 /// +5.68 /// Rb7

SF 14 initially considers the incorrect Rb7, but only at depth 44 does it find...a8=B! For more information, check out Tim Krabbé's brilliant blog: https://timkr.home.xs4all.nl/chess2/minor.htm

The PV Breakdown

The principle variation is the set of moves that the engine thinks is best for both sides. PVs in endgames are helpful, but in middlegames, they tend to "break down."

Again, remember how engines work. Engines are evaluating future board positions. Therefore, it is to be expected that a move that is suggested in the engine's long PV will not always be the best move. If the engine is analyzing a position at move 10 and gives a 20-move line, move 29 of the line will probably not be the best move because then the engine might have to evaluate positions another 20 moves ahead to determine the best move on move 29!

Here's a game between Fabiano Caruana and Magnus Carlsen that was played during Fabiano's amazing run in the Sinquefield Cup in 2014. I chose this game because it's probably too complex for even a human + computer pair to completely understand (and this actually speaks volumes about the complexity of chess since compared to incomprehensible games like Ivanchuk - Yusupov, 1991, the Caruana - Carlsen game is a piece of cake), and it highlights some very important ideas that one should be aware of when looking at the PVs that the engine gives.

Let's take a look at the PV that Stockfish gives for Black on his 22nd move (taken at depth 35)

https://lichess.org/study/S8wAvJ46/YEjzdSu4#43

There are many (minor) improvements along the way (moves 36 and 41, for example), at least according to SF 14 itself! However, the most flagrant wrong move is move 43, the final move of the PV: The PV suggests 43. Kc4, which, after analyzing with SF 14, isn't even among the top 3 best moves! These are the best 3 moves at depth 33 when analyzing move 43:

Depth 33 /// +3.93 /// a5
Depth 33 /// 2.95 /// Rb6
Depth 33 /// 2.02 /// Bh3

Expected! Don't fully trust the engine's PV!

Conclusion

Chess is a complex game, but fret not, chess lover! Humanity's unquenchable thirst for knowledge has produced some high-powered machines capable of aiding us in understanding our beloved game: it's up to you to make good use of these helpful friends, and I hope that my article has at least helped a bit in introducing you, my dear reader, to such concepts as "The PV Breakdown," "The Engine Endgame Effect," and the "Stability Point."

Take-Home Message: It is of prime importance that one not view the engine evaluation as a standalone insightful tool to correctly evaluate the position. More than just looking at the engine evaluation, one should note that it is important to see how exactly the engine continues to play the position. If it gives +1 yet goes down a pawn-up rook endgame with extremely high drawing chances, the engine obviously has its evaluation function much more indicative of the material value on the board than it is of the proper end result of the game. Moreover, when going through an engine line to see if its evaluation function makes sense, it is important to see if the line it gives is actually best at every turn - sometimes, ten moves down the line, the engine will start giving some subpar moves.

Therefore, to correctly use an engine, one must also use it alongside an opening database, an endgame tablebase, as well as the proper mind to a) let it reach sufficient depths b) play through the lines it gives and c) improve the lines it gives lest it suffers from the "PV" breakdown issue already mentioned in the article.