lichess.org
Donate

Stockfish and Komodo Benchmarks with Ryzen Threadripper

Number of nodes/sec says nothing about playing strength. Komodo has more sophisticated evaluation function than Stockfish that causes his search speed to be slower, but compensates in terms of Elo points by superior positional understanding. What is the purpose of such benchmarks?
@wolfram_ep is correct. Also, it's incredibly hard to actually count the number of nodes checked, so they're just estimates anyway. Basically, the way node count works in Stockfish is that it iterates through each thread and reads the count of the number of nodes each thread has traversed. As it's iterating through the threads, the node count can still be increasing. Basically, the later the thread is in the iteration order, the more time it gets and the earlier the less time it gets. Doesn't matter if you do the time delta before or after, you won't get an accurate count.
#2 yep thats true, i used a misleading ore even wrong title, this is a cpu benchmark with chess engines.
If #3 is right and there is no accurate way to measure nodes/sec, and all benchmarks with chess engines are only estimations, i guess this estimations are pretty good, otherwise it would be meaningless.
As was mentioned, the point of such benchmarks is to compare the CPUs, not the engines.

So long as the counting is done in fairly consistent way and multiple measurements are taken there's no huge problem with using it as a CPU benchmark.

The unfortunate thing with that site's report is that they don't say what value they use for threads (i.e., are they setting it to the number of physical cores or the number of logical cores?), which could greatly affect the results.

At any rate, the actually tricky thing is inferring anything from such results about how strong SF or K will be on each processor compared to the others.

If you have the same number of cores with the same setting for threads, then it's pretty simple. You can more or less count on a higher NPS meaning higher playing strength.

When the number of cores and the setting for threads is different, you can't really do that anymore.

SF on a 32-core machine getting 32 Mnps could be measurably weaker than SF on a 10-core machine getting 20 Mnps, because of inefficiencies in parallel search in chess (basically you get diminishing returns from adding extra threads, so at some point having fewer but faster cores is better than having more but slower cores).

In the olden days when parallel search basically just tried to divide up the serial search space as efficiently as possible, you could still get a fairly reasonable idea of relative strength across processors for chess by measuring how long it took an engine to get to a certain depth on a bunch of positions (time-to-depth or TTD) on each processor.

These days, with the rise in popularity of Lazy SMP (definitely used in SF, and from hints here and there on talkchess from the Komodo team, something similar used there as well), a different approach that doesn't even try to divide the serial search across threads, you can't even use that metric anymore.

Unfortunately, for comparisons of the strength of, e.g., Stockfish across multiple processor platforms, the only true measure is having them play a bunch of games to test for a rating difference, which is rather impractical.

Fun times we live in :)

@JDawg0

It's very true, but don't take my word for it. Do your own research to falsify/verify.

Also, do note that I didn't say 32 Mnps/10 Mnps on 32/10 cores.

Then you'd be getting the same per-core speed in each case, and barring a huge search bug in an engine, the extra cores will be better.

I used as my example 32 Mnps/20 Mnps on 32/10 cores.

Since there are rapidly diminishing returns from additional cores because of the inefficiencies inherent in all known and used parallel search approaches, having a twice-as-fast per-core speed with fewer cores could very easily be stronger.

It's obviously not possible with current technologies, but in the extreme case, SF getting 32 Mnps on a single core would be substantially stronger than SF getting 32 Mnps on 32 cores.

The exact point at which slower-but-more cores equals faster-but-fewer cores is just an empirical matter for each engine.

Again, though, you don't need to take my word for it. Just do some research to confirm/disconfirm. In such a technical area of study, neither my words on a forum now how difficult you find something to believe should be trusted too much on their own merit :)

Also, yes, Lazy SMP is a much simpler approach from a coding perspective, but that doesn't mean it scales linearly or doesn't suffer from diminishing returns (indeed, there are diminishing strength returns on more time, period, as a result of more and more draws as the quality of play goes up, so it's not surprising that there are still diminishing returns when the extra CPU time is gained from a less efficient process).

There is indeed almost no overhead from organizing threads in Lazy SMP, but that's also because it no longer even attempts to divide the serial search tree's work any more.

Instead, it just starts a bunch of threads searching to different depths, and relies on the information from the deeper search threads being stored in the hash table to help the overall search.

However, that also has inefficiencies. With a nominal search depth (these days, more appropriately just an "iteration", since actual depth is highly variable) of 20, having two threads with one thread searching to depth 20 and one searching to say, depth 22, and communicating via the hash table will result in an increase in strength, but it's still going to be less than what you'd get from having twice the CPU time in a single thread search, and the gains will still diminish as you add more and more threads.

Further, the "overhead" of even the traditional parallel algorithms that attempted to split the tree was not solely from communicating results between threads.

It was a fundamental limitation of splitting the work of an alpha-beta search. Alpha-beta search works well precisely because as the search proceeds, it can ignore branches that it knows can't beat the current best line (for either player).

When you split that among threads, now each thread doesn't have all the information about bounds it would need to ignore all the branches a serial search could ignore, so even the most sophisticated attempts to split the tree mean that an 8-thread parallel alpha-beta search to some depth will have to examine a larger tree than a serial alpha-beta search to the same depth.

Now, this problem is still related to the cost of communication between threads, because if the cost of communication were zero, then all threads could communicate all information all the time and avoid this limitation.

Since the cost is not zero, though, that is not feasible, so you end up with a bigger-than-necessary search tree. The actual overhead for a parallel search like that is then a combination of the cost for the communication it DOES choose to do and the need to search a bigger search tree to get to the same depth.

This is all well-explored, but again, nothing like doing your own research. I certainly don't want to be blindly believed :)

Finally, yes, it is true that some engines have tried to adjust their reported nps based on the number of threads used to account for parallel inefficiencies, but these attempts have all been based on some generally assumed "average" speedups (at least, all the ones the details for which have been divulged and of which I am aware, obviously).

They're not necessarily all that accurate (ultimately playing strength is all that really matters, and that requires a lot of testing).
Im happy amd made a comback in the cpu world.

This topic has been archived and can no longer be replied to.