SF Horde Strength

With me only taking black (and playing at a ~3+2 pace), it beat me 6-2. I didn't play particularly well, but in the last 2 losses ( and I allowed myself some endgame takebacks and still couldn't win.
I won and I really need to play 1... h5 else or after 1. a5 and 2. h5 I can't pierce the horde anymore (but that's just me being weak). I'd say it's 24xx, maybe not quite as good as in, say, zh.

I don't think I can beat it as white though (I lost twice and won't be really trying).

@FischyVishy The improvement relative to the baseline is similar for crazyhouse and horde, but the baselines are very different. SF crazyhouse defeated most of the top engines even before the first release, whereas SF horde started at a level of roughly 1500.

I mostly agree with @lecw's estimation. The playing strength of level 8 seems to be somewhere in the range of 2300-2500. The "perceived" playing strength might be lower, because if you find one opening where it plays bad moves, then you can usually just repeat the moves to win most the time.

The problem with horde chess for an engine is that horde is asymmetric and very strategical, so the evaluation is very difficult and important at the same time. The difference in the evaluation between moves usually is very small and it only gets apparent which move is better after a lot of moves, which makes it harder to selectively search only the good moves. In crazyhouse there is a much bigger number of possible moves, but usually most of them can already be discarded at very low depths, so the search quickly is guided into the right direction.

I already added several horde-specific evaluation terms like for the distance from breaking through to the first rank or the imbalance of the horde (difference in the number of pawns on neighboring files), and that helped a lot, but it still often seems to underestimate the importance of such factors.

FWIW after 3-1 I expected to be able to win more by repeating the line, but that just didn't happen. I almost had to invent a new win the second time around.

It very much didn't play the same opening every time, and even after a takeback it didn't play the same moves.

In a recent game
it strikes me that SF evals pos after move 46 as only +2 (moderate white advantage ?). It seems to me white advantage is bigger than that = will become crushing soon.

Relatedly, I find it to happen more or less often that games go from +2 to +11 or more in one move. Makes me wonder : what does a +6 position look like ? I suspect they are hard to find and suggest "small" white advantages should be scaled to larger values.

Maybe it just means that we humans are weak players of black pieces (SF really has tremendous defensive ressources with the pieces when the pawns are on ranks 6-7, as to be expected from a comp), which is why we jump so straight from "position is a bit inferior" to "now you're doomed". What do you think ?

Black does not seem to be able to avoid that white promotes a pawn, but Stockfish still evaluates the position as far from decided, since the two rooks and the king roughly have the same value as the remaining pawns, but with the queens on the board it should be winning for white, I think. Playing through some variations I also noticed that it often exchanges the promoted queen, which I think is usually not a good idea when playing white. It might be worth trying to add an additional evaluation term for the horde side in order to discourage it from exchanging queens. I already added a correction term some time ago to make the evaluation drawish if white is down material but has a queen, but the case when white has a queen but is only slightly ahead in material is not very well covered yet. I will maybe test some ideas based on that in the next days.

Here are the static evaluations of the position after 46. c5 and of a position in a sideline when white promotes a pawn.

position fen 2bk4/5q1r/rPPP2p1/2PPP3/4PP1P/1PP2PP1/P1P2PPP/1P4PP b - - 0 46
Eval term | White | Black | Total
Material | --- --- | --- --- | -6.46 -3.86
Imbalance | --- --- | --- --- | 3.06 3.06
Pawns | --- --- | --- --- | 4.62 -1.02
Knights | 0.00 0.00 | 0.00 0.00 | 0.00 0.00
Bishops | 0.00 0.00 | -0.05 -0.06 | 0.05 0.06
Rooks | 0.00 0.00 | 0.08 0.06 | -0.08 -0.06
Queens | 0.00 0.00 | -0.01 0.01 | 0.01 -0.01
Mobility | 0.00 0.00 | 0.20 1.07 | -0.20 -1.07
King safety | -0.15 0.00 | -1.44 -0.20 | 1.29 0.20
Threats | 0.47 0.28 | 1.55 1.33 | -1.08 -1.06
Passed pawns | 1.28 3.15 | 0.00 0.00 | 1.28 3.15
Space | 1.38 0.00 | 0.02 0.00 | 1.36 0.00
Initiative | --- --- | --- --- | 0.00 0.00
Total | --- --- | --- --- | 2.49 -0.62

Total Evaluation: 1.10 (white side)

position fen 1Q6/3k3r/r1q3p1/2P1P3/5P1P/1PP2PP1/P1P2PPP/1P4PP w - -
Eval term | White | Black | Total
Material | --- --- | --- --- | 1.62 2.75
Imbalance | --- --- | --- --- | 2.45 2.45
Pawns | --- --- | --- --- | -0.35 -3.36
Knights | 0.00 0.00 | 0.00 0.00 | 0.00 0.00
Bishops | 0.00 0.00 | 0.00 0.00 | 0.00 0.00
Rooks | 0.00 0.00 | 0.08 0.06 | -0.08 -0.06
Queens | 0.00 0.00 | -0.00 0.00 | 0.00 -0.00
Mobility | 0.29 0.48 | 0.26 1.27 | 0.03 -0.79
King safety | -0.20 0.00 | -2.19 -0.50 | 1.99 0.50
Threats | 0.00 0.00 | 2.73 2.24 | -2.73 -2.24
Passed pawns | -0.28 0.50 | 0.00 0.00 | -0.28 0.50
Space | 0.64 0.00 | 0.02 0.00 | 0.62 0.00
Initiative | --- --- | --- --- | 0.00 0.00
Total | --- --- | --- --- | 2.65 -0.25

Total Evaluation: 1.22 (white side)

Aahh, right, the white queen is supposedly worth as much as the black one, so it's still equal. That makes sense, though like you I think it's just wrong.

Once white has a queen it's less like horde and more like a classical endgame where many marginal pawns easily submerge few marginal pieces.

It's not just that SF exchanges the white queen, it's also that black tries very hard to force the exchange (as he should), so white can only exchange it, or hide it at weak places (which imo remains better though).

@ubdip Care to explain a bit how this works ? I got a few specific questions.

What is LLR ? I seem to gather modifs are judged on that or ELO change. Log-likelihood ratio ? Probability that the change is good given the result of games vs unmodified code ?

I see `` *200` ` : what is the base unit of the bonus ? millipawns ? You made it "each pawn is worth 1.2 instead of 1 if a queen is present, 1.4 if there are two queens, etc.", is that right ?

And in the eval, what is "MG EG" ? "Pawns" I suppose is a horde-specific material-balancing factor ? What is "Imbalance" ?

Can you try just making a horde-sided queen worth "12" ? (i.e. its current value and 2 pawns, i.e. a bonus of ` pieceCount[Us][QUEEN] * 2000` if I understand correctly).

I see that by promoting, the horde side gained material (I think, I mean in the top rows), and lost on the Threats (a lot), Passed pawns, and Space factors. Is this because black can now threaten the horde queen, or because it got easier to get to the back rank ?

I'm under the feeling that white had so many bonuses for being close to promotion that now that he has a real queen, it's hardly worth more. That runs contrary to my intuition that queening in horde is super hard, but once you actually have a queen then it's game over. Maybe that intuition is not objectively true and it's just that time-pressed humans give up at that point, I dunno for that.

Maybe try removing black's bonus for getting to the back rank if white has a queen ?

LLR is the log-likelihood ratio of an SPRT test ( ). Compared to a fixed number of games test and then judging based on Elo, SPRT uses resources more effectively and stops tests as soon as you have a statistically significant result.

The evaluation scale is kind of arbitrary. Currently a pawn is around 320 in horde, but in the specific code where I added the term, the numbers are divided by 16 beforehand, i.e. 200 in this case is about 0.04 pawns. E.g., 200 * 1 (queen) * 20 (pawns) / 16 / 320 (pawn value) = 0.8 is a bonus of a bit less than a pawn.

MG and EG mean middlegame and endgame. Positions are always evaluated as if they were middlegames and endgames, respectively, and in the end the numbers are combined depending on the material count, i.e., higher weight on the endgame value if there is few material left.

I can try to increase the queen value for the horde side. This value has been tuned automatically, but the tuning does not always find the optimal values.

In the position with the queen, white gave up several pawns on the fifth and sixth rank to be able to promote, so the values of white's passed pawns ("Passed pawns"), its pawn structure ("Pawns"), and the space advantage ("Space") decreased. "Imbalance" is part of the material evaluation and evaluates the asymmetry between white's and black's pieces. This is also the part where I tried to add the bonus for white's queen.

"Maybe try removing black's bonus for getting to the back rank if white has a queen ?"
That is a very good idea, I'll try that. That really describes why having the queen helps so much. It very much stabilizes the horde to have a queen that can support almost any pawn on the board by just making one or two moves.

A white queen also usually sets up a lot of threats, e.g. by attacking trapped rooks or by using checks and forks to gain material on a board where black usually does not have any pawn shelter left. I will think about how this can be taken account of in the king safety or threat evalution.

Edit: Tests are running now: