Using Lichess's Public Data To Find The Best Chess 960 Position

LichessChess variantSoftware Development
Engine evaluations aren't the full story on 960 positions

What Does "Best Starting Position" Mean?

The first instinct to compare Chess 960 positions would be to use an engine evaluation at high depth and pick whichever starting position is closest to 0.0. However, this comes with serious limitations. A dead-even computer evaluation can mean many things for players: a theoretical endgame that ends in a draw, a sharp position that gives chances to both players but is a draw with perfect play, a strong attack for humans but with perfect computer defense can be met, etc. So what's a more human way to rank 960 starting positions?

The criteria I came up with:

  1. starting positions that give close to equal chances to both players
  2. people hate draws so less of that

One could look through all of the 960 starting positions and do evaluations, but a much simpler way is to let people play out a bunch of these positions and see where they end up. Fortunately, Lichess puts all of their variant games in an open database and we can aggregate a few months together to get a sufficient dataset.

I chose to use the last four months of data (April 2022 - July 2022) to get around 1 million games (has the nice property that each position should have around 1,000 games) and created a pipeline using a Jupyter notebook (see end for link). Each file is a list of games in PGNs so we just have to extract our relevant data from the headers. Here's a code block to get an idea of how to do that:

def parse_game_lines(data):
    games = []
    current_game_data = []
    new_line_count = 0
    for line in data:
        if line == '\n':
            new_line_count += 1
        if new_line_count == 2:
            new_line_count = 0
            current_game_data = []
    if current_game_data:
    return games

def convert_pgns_to_games(pgns):
    #Use a generator as we want to extract data from these game objects and don't need them all at once
    return ( chess.pgn.read_game(io.StringIO(pgn)) for pgn in pgns )
def extract_game_data(game):
    headers = game.headers
    return {
        'black_elo': headers['BlackElo'],
        'white_elo': headers['WhiteElo'],
        'position': headers['FEN'].split("/")[0],
        'time_control': headers['TimeControl'],
        'result': headers['Result']

We now have a list of games but we'd like to aggregate it by position, so here's how we get it into that aggregated form:

result_df = filtered_data.groupby(['position', 'result']).size().reset_index()
result_df['result_percentage'] = result_df['games'] / result_df.groupby('position')['games'].transform('sum')

For simplicity, we'd like to pivot this data to have the results as columns and the percentages as the values, so we can do that with pandas like so:

results_pivot = pd.pivot_table(result_df, values='result_percentage', index=['position'], columns=['result']).reset_index()

Now that we have everything in the nice form, how do we take our two criteria and create a single metric? I chose to use the absolute difference of win chances and the draw percentage to create something that looks like this:

def calculate_fun_index(x):
    return (1-abs(x['1-0']-x['0-1']))-(x['1/2-1/2'])

This index creation is somewhat arbitrary and up to preference and I liked this form because:

  1. Range 0-1 and higher is better
  2. A 1pp difference in win rates has the same effect as 1pp difference in draw rates

With our newly created fun index we can now rank our positions.

What Do The Games Say?

Here is what our index likes the least:

And what our index says is the "best" position:

The "worst" position has a large empirical advantage for white and still a sizable draw position while the "best" position has equal chances for both colors and a relatively low draw rate. Perhaps someone with better chess knowledge than myself can give a better explanation why this is the case.

We can also look at other result metrics. Here's the position that has the best empirical chances for black:

And here's the best empirical chances for white:

Something interesting about all of these positions is their empirical chances often don't match the starting engine evaluations (the best position for black has white better).

We can also look at draw rates. Here's the position that has the best empirical chances to draw:

Hate draws? Here's the position that has the lowest empirical draw chances:

What would be better is if we could see expected results for different Elo ranges and time controls. We could try binning data together to get estimates, but we'd rather have some type of function that approximates the probabilities for each of the results. This sounds like a classic classification problem.

Let's Get Weird (Running A Machine Learning Model)

We have our explanatory covariates (white Elo, Elo difference, time control class, starting position) and three outcomes (white win, white lose, draw) so this simplifies to a multiclass classification problem. We have a ton of categorical features, so using a catboost model is an obvious choice. You can see the notebook for the exact methodology, but the important point is we can feed our model arbitrary starting data and it will output point estimates of the predicted results.

We're interested in our fun index among different time controls and different relative strengths of opponent so we construct a prediction dataframe of our various combinations and take an average of the index over each of our positions.

So what did our model like? I chose to look at low end of Elo (1000) and the high end of Elo (2500) to see if it would give different position rating recommendations for different skill levels.

Here's the "worst" starting positions for these two levels:

And here's the "best" starting positions for these two levels:

This model gives very different rankings for our different skill levels, but as we are averaging among time controls, relative strength, and white Elo, it might be better to look at what our model says for the median user. What seems reasonable is to look at even matches for blitz 960 games at 1500 Elo.

Getting Less Weird (Looking At Our Model at 1500 Elo)

Adjusting our predictions to 1500 white Elo, no rating difference, and blitz time controls.

Here's our "worst" starting position with those adjustments:

And here's our "best" starting position with those adjustments:

Our model gives lower predicted chances at 1500 Elo than the empirical results. What's going on here? Something that should hold both theoretically and empirically is that white's win chances generally increase with player skill (you can see a distribution of this in the notebook). So our model is likely reflecting this, but as expected this model is hard to interpret.

Thoughts About This Model And Future Ideas

As expected, taking a bunch of 960 games and naively chucking them at a machine learning model and using an arbitrary indexing function left us with more questions than answers. However, this is a good start to thinking about these starting positions closer to how people play them. Creating features from positions and simpler more interpretable models will likely be needed to better explain why empirical results can deviate so far from their starting engine evaluations. Please feel free to leave a comment if you have good explanations for the empirical results listed above. I've left some links below if you want to look into this data for yourself.

Further Info

  • Python Notebook (how I generated these results): link
  • Empirical Results: link
  • Model Results Table: link
  • Lichess Variant Open Database: link