Puzzle issues resulting in a huge drop of puzzle quality (on the 2500+ range)

Popularity = (upvotes - downvotes)/(upvotes + downvotes)

[ EDITED ] And "although votes are weigthed by various factors such as whether the puzzle was solved successfully or the solver's puzzle rating in comparison to the puzzle's." - source: https://database.lichess.org/#puzzles

But there are a few issues with this kind of ranking. Those issues are particularly affecting the harder range, of puzzles rated above 2500.

Issue 1) The puzzle trainer UI is showing the thumbs up/down options too early. If you get the very first move wrong, you can then click the "View Solution" button. That will then bring the "Puzzle Complete!" phrase with the big thumbs up/down (the voting options). The issue here is that the player has not seen yet the remaining moves from that puzzle. The moves are added to the move list but that not what catches the user's attention. So there is a tendency there to capture votes that are only linked to the very first move of the puzzle. Example:
https://i.imgur.com/myY7r2R.png

Expected behavior:

The voting options should not be shown until the user reaches the puzzle's end position.
All existing votes that came from non-finished puzzles should be removed from database (if it is possible to identify them), so that popularity can be re-calculated.

Issue 2) The popularity formula may is resulting in a distribution which is too skewed to the high end:
https://i.imgur.com/bF2WVIm.jpg
OBS: Right side charts is using log scale.

As soon as a new puzzle gets it's first upvote, I think it gets as #1 in the ranking. That may explain why I have found a significant deterioration of puzzles quality after the introduction of Stockfish 15 (which is another issue related to the puzzle generator).

Issue 3) Other very important variables which certainly do impact on the vote may not being taken into account by the popularity formula, such as how many rating points this user gained or lost. That certainly can affect the user's mood, thus affecting his vote. I could analyze the relationship between those variables by plotting more charts with python seaborn library, but data I need is not be available publicly, I guess. [ EDITED ]

Issue 4) As a matter of a fact, improvement is needed in the popularity formula and on the puzzle picking criteria. A suggestion I have is to create an automated puzzle quality evaluation variable, the mix that with the puzzle popularity variable when picking new puzzles for a user. I will expand on that idea later. It's about using stockfish evaluation progression to estimate how solid or not is the actual win calculated by the engine in the puzzle end position.

Good puzzles:
https://i.imgur.com/YKDDtix.png
(from the puzzle's end position, stockfish can see an advantage right away)
(the third one still has some complexity going on though)

Bad puzzles:
https://i.imgur.com/BgZQ3bN.png
(from puzzle's end position, stockfish still needs to look 6 ply moves ahead before it can see an advantage)

Please leave your feedback here too.

We have 2.5 million puzzles, from which Lichess picks the most popular for us. Assiduous puzzle users do play a few thousands of puzzles, but that's yet around a thousandth of what's available. Hence the puzzle popularity variable is quite important. Today, it has a quite simple formula, the percentage of upvotes: Popularity = (upvotes - downvotes)/(upvotes + downvotes) [ EDITED ] And "although votes are weigthed by various factors such as whether the puzzle was solved successfully or the solver's puzzle rating in comparison to the puzzle's." - source: https://database.lichess.org/#puzzles But there are a few issues with this kind of ranking. Those issues are particularly affecting the harder range, of puzzles rated above 2500. Issue 1) The puzzle trainer UI is showing the thumbs up/down options too early. If you get the very first move wrong, you can then click the "View Solution" button. That will then bring the "Puzzle Complete!" phrase with the big thumbs up/down (the voting options). The issue here is that the player has not seen yet the remaining moves from that puzzle. The moves are added to the move list but that not what catches the user's attention. So there is a tendency there to capture votes that are only linked to the very first move of the puzzle. Example: https://i.imgur.com/myY7r2R.png Expected behavior: - The voting options should not be shown until the user reaches the puzzle's end position. - All existing votes that came from non-finished puzzles should be removed from database (if it is possible to identify them), so that popularity can be re-calculated. Issue 2) The popularity formula may is resulting in a distribution which is too skewed to the high end: https://i.imgur.com/bF2WVIm.jpg OBS: Right side charts is using log scale. As soon as a new puzzle gets it's first upvote, I think it gets as #1 in the ranking. That may explain why I have found a significant deterioration of puzzles quality after the introduction of Stockfish 15 (which is another issue related to the puzzle generator). Issue 3) Other very important variables which certainly do impact on the vote may not being taken into account by the popularity formula, such as how many rating points this user gained or lost. That certainly can affect the user's mood, thus affecting his vote. I could analyze the relationship between those variables by plotting more charts with python seaborn library, but data I need is not be available publicly, I guess. [ EDITED ] Issue 4) As a matter of a fact, improvement is needed in the popularity formula and on the puzzle picking criteria. A suggestion I have is to create an automated puzzle quality evaluation variable, the mix that with the puzzle popularity variable when picking new puzzles for a user. I will expand on that idea later. It's about using stockfish evaluation progression to estimate how solid or not is the actual win calculated by the engine in the puzzle end position. Good puzzles: https://i.imgur.com/YKDDtix.png (from the puzzle's end position, stockfish can see an advantage right away) (the third one still has some complexity going on though) Bad puzzles: https://i.imgur.com/BgZQ3bN.png (from puzzle's end position, stockfish still needs to look 6 ply moves ahead before it can see an advantage) Please leave your feedback here too.

CM TBest

edited

First, I applaud the effort you have put into this post!

There are however some things that I wish you could explain more/better and perhaps more importantly, some things you have missed.

Popularity = Upvotes / (Upvotes + Downvotes)
This is quite inaccurate. As mentioned on https://database.lichess.org/#puzzles ,
votes are weigthed by various factors such as whether the puzzle was solved successfully or the solver's puzzle rating in comparison to the puzzle's.

In other words, the popularity formula is not so simple as you suggest.

That may explain why I have found a significant deterioration of puzzles quality after the introduction of Stockfish 15
Can you elaborate on this? First, how do you measure quality? (popularity?) Secondly, why would Sf 15 have an impact?

Ref issue 4 and "puzzle popularity variable when picking new puzzles for a user"
Popularity is already factored in when a new puzzle is selected. So curious what the new idea really is here.

I like the images of "good puzzle" and "bad puzzle" would love to see data, if that also matches with popularity for example. I think I could say more too, but don't want to make a wall of text :D Things like what you are doing here is fun through!

EDIT: Basically, If you look more into how puzzles work, it would be worth the time to check your assumptions with how they are coded.

First, I applaud the effort you have put into this post! There are however some things that I wish you could explain more/better and perhaps more importantly, some things you have missed. >Popularity = Upvotes / (Upvotes + Downvotes) This is quite inaccurate. As mentioned on https://database.lichess.org/#puzzles , > votes are weigthed by various factors such as whether the puzzle was solved successfully or the solver's puzzle rating in comparison to the puzzle's. In other words, the popularity formula is not so simple as you suggest. > That may explain why I have found a significant deterioration of puzzles quality after the introduction of Stockfish 15 Can you elaborate on this? First, how do you measure quality? (popularity?) Secondly, why would Sf 15 have an impact? Ref issue 4 and "puzzle popularity variable when picking new puzzles for a user" Popularity is already factored in when a new puzzle is selected. So curious what the new idea really is here. I like the images of "good puzzle" and "bad puzzle" would love to see data, if that also matches with popularity for example. I think I could say more too, but don't want to make a wall of text :D Things like what you are doing here is fun through! EDIT: Basically, If you look more into how puzzles work, it would be worth the time to check your assumptions with how they are coded.

Sybotes

One more problem might be: People who just forget to use one of the thumbs ... I am one of them.

@TBest: You get a discussion of that topic and several examples here: lichess.org/forum/lichess-feedback/bad-quality-puzzles-recently

One more problem might be: People who just forget to use one of the thumbs ... I am one of them. @TBest: You get a discussion of that topic and several examples here: lichess.org/forum/lichess-feedback/bad-quality-puzzles-recently

EvilChess

Thanks @TBest for the correction. I have now edited the original post to update it accordingly. The problem was that I read that spec so fast (in Lichess database page). So the spec became stored in my subconscious, but not in my conscious mind. :-)

EvilChess

What I know for sure is the patterns I see in those line charts have been matching with my perception of good/bad puzzle. That's why I said that engine can be used to evaluate the puzzle quality. And that evaluation be used as an additional quality indicator. A more stable & reliable one when compared to the human voted popularity.

@TBest what I said about Stockfish 15 is just a conjecture, mostly because it's release coincides with the moment I noticed puzzle quality degradation. But it could be something else. Need more data to check if I'm not just painting the new engine as my scapegoat. :-) What I know for sure is the patterns I see in those line charts have been matching with my perception of good/bad puzzle. That's why I said that engine can be used to evaluate the puzzle quality. And that evaluation be used as an additional quality indicator. A more stable & reliable one when compared to the human voted popularity.

This topic has been archived and can no longer be replied to.