Free chess game database with over 11 million games (Scid vs. PC database format)

This json is very ugly, do you know a script, which downloads the broadcast PGNs?

I asked ChatGPT to write a Python script (it seems that lichess' forum do not support code formatting):

import requests
import json

def find_json_objects(text):
json_objects = []
stack = 0
start, end = None, None
for i, char in enumerate(text):
if char == '{':
if stack == 0:
start = i
stack += 1
elif char == '}':
stack -= 1
if stack == 0 and start is not None:
end = i + 1
try:
json_objects.append(json.loads(text[start:end]))
start, end = None, None
except json.JSONDecodeError:
print("Error parsing JSON object.")
return json_objects

def download_pgn(tournament_id):
pgn_url = f"https://lichess.org/api/broadcast/{tournament_id}.pgn"
response = requests.get(pgn_url)
if response.status_code == 200:
save_path = f"{tournament_id}.pgn"
with open(save_path, "wb") as file:
file.write(response.content)
print(f"PGN file for {tournament_id} downloaded successfully and saved as {save_path}.")
else:
print(f"Failed to download PGN for {tournament_id}. Status code: {response.status_code}")

def main():
url = "https://lichess.org/api/broadcast?nb=100"
response = requests.get(url)
if response.status_code == 200:
json_objects = find_json_objects(response.text)
print(f"Found {len(json_objects)} tournaments.")
for obj in json_objects:
tour = obj.get('tour', {})
name = tour.get('name')
tour_id = tour.get('id')
print(f"Name: {name}, ID: {tour_id}")
download_pgn(tour_id)
else:
print(f"Failed to download the initial JSON. Status code: {response.status_code}")

if __name__ == "__main__":
main()

@Lumbra74 said in #20: > This json is very ugly, do you know a script, which downloads the broadcast PGNs? I asked ChatGPT to write a Python script (it seems that lichess' forum do not support code formatting): ```python import requests import json def find_json_objects(text): json_objects = [] stack = 0 start, end = None, None for i, char in enumerate(text): if char == '{': if stack == 0: start = i stack += 1 elif char == '}': stack -= 1 if stack == 0 and start is not None: end = i + 1 try: json_objects.append(json.loads(text[start:end])) start, end = None, None except json.JSONDecodeError: print("Error parsing JSON object.") return json_objects def download_pgn(tournament_id): pgn_url = f"https://lichess.org/api/broadcast/{tournament_id}.pgn" response = requests.get(pgn_url) if response.status_code == 200: save_path = f"{tournament_id}.pgn" with open(save_path, "wb") as file: file.write(response.content) print(f"PGN file for {tournament_id} downloaded successfully and saved as {save_path}.") else: print(f"Failed to download PGN for {tournament_id}. Status code: {response.status_code}") def main(): url = "https://lichess.org/api/broadcast?nb=100" response = requests.get(url) if response.status_code == 200: json_objects = find_json_objects(response.text) print(f"Found {len(json_objects)} tournaments.") for obj in json_objects: tour = obj.get('tour', {}) name = tour.get('name') tour_id = tour.get('id') print(f"Name: {name}, ID: {tour_id}") download_pgn(tour_id) else: print(f"Failed to download the initial JSON. Status code: {response.status_code}") if __name__ == "__main__": main() ```

Lumbra74

#22

The script doesn’t work for me - but I wrote a bash script doing the same.

Now I just nee to code a routine, that doesn't download already downloaded files...

Thx ;)

The script doesn’t work for me - but I wrote a bash script doing the same. Now I just nee to code a routine, that doesn't download already downloaded files... Thx ;)

Gaiil

#23

Hi @Lumbra74 I am looking for the same thing, is it ok if I ask for a copy for you bash script. I hope you don't mind. Thanks a lot

Lumbra74 edited

#24

I've just release a new version of the database, which now includes the "Lichess Elite Database" and the games pulled from the lichess broadcast system.

Now in the file formats:

si4 (Scid vs.PC/MAC)
si5 (Scid 5.0)
PGN (divided into different time ranges)

Contents:

12.494.360 Games
529.315 Player
86.010 Events
26.667 Locations

I've just release a new version of the database, which now includes the "Lichess Elite Database" and the games pulled from the lichess broadcast system. Now in the file formats: - si4 (Scid vs.PC/MAC) - si5 (Scid 5.0) - PGN (divided into different time ranges) Contents: - 12.494.360 Games - 529.315 Player - 86.010 Events - 26.667 Locations

Lumbra74 edited

#25

Here is another small update regarding the broadcasts:

Within a one event of the broadcast (max 200) there are again several events, of which those that have already been completed are given the tag "finished".
I am currently pulling the overview of all broadcasts and then scanning the JSON file for sub-events that are completed at the time of the query. I save the names of these events in a separate file so that they are not downloaded again during the next query cycle.

The script runs via cron once a day so that I always have the latest data.

The script takes about 3 hours for the inital download - as the Lichess servers keep throwing out a return code that asks you to wait a minute until the next request.

Here is another small update regarding the broadcasts: Within a one event of the broadcast (max 200) there are again several events, of which those that have already been completed are given the tag "finished". I am currently pulling the overview of all broadcasts and then scanning the JSON file for sub-events that are completed at the time of the query. I save the names of these events in a separate file so that they are not downloaded again during the next query cycle. The script runs via cron once a day so that I always have the latest data. The script takes about 3 hours for the inital download - as the Lichess servers keep throwing out a return code that asks you to wait a minute until the next request.

Lumbra74 edited

#26

The structure of the database is now complete. All data has been cleaned up as far as possible.

New games will be added weekly and then uploaded - probably on Tuesdays or Wednesdays, depending on the release of the data from TWIC.

In addition to a database for Scid vs.PC/MAC (si4), a version for Scid 5.0 (si5) is now online, as well as individual PGN files for different time periods or years. I also will release a monthly "cumulative" update as a PGN file, containing all games from the previous month.

The database now contains:

More than 12.490.000 Games
More than 528.000 Playe
More than 85.000 Events
More than 26.600 Locations

The structure of the database is now complete. All data has been cleaned up as far as possible. New games will be added weekly and then uploaded - probably on Tuesdays or Wednesdays, depending on the release of the data from TWIC. In addition to a database for Scid vs.PC/MAC (si4), a version for Scid 5.0 (si5) is now online, as well as individual PGN files for different time periods or years. I also will release a monthly "cumulative" update as a PGN file, containing all games from the previous month. The database now contains: - More than 12.490.000 Games - More than 528.000 Playe - More than 85.000 Events - More than 26.600 Locations

kajalmaya

#27

This is great work. I have a few questions.

How did you remove duplicates when the game score (moves) matched exactly but metadata had some variations? For example, player name may be Karpov, A or Karpov, Anatoly or Anatoly Karpov, etc. If I remember having seen, there is another Karpov, A somewhere. Isn't this difficult to fix if there are 1000s of such games?

When multiple copies are found, is there some way in scid (or pgn-extract or another tool) to keep a version from a preferred source?

What did you do with historic games (pre-FIDE Elo era) between strong players without ratings?

This is great work. I have a few questions. How did you remove duplicates when the game score (moves) matched exactly but metadata had some variations? For example, player name may be Karpov, A or Karpov, Anatoly or Anatoly Karpov, etc. If I remember having seen, there is another Karpov, A somewhere. Isn't this difficult to fix if there are 1000s of such games? When multiple copies are found, is there some way in scid (or pgn-extract or another tool) to keep a version from a preferred source? What did you do with historic games (pre-FIDE Elo era) between strong players without ratings?

kajalmaya

#28

Also, do you keep rapid, blitz, online etc?

Lumbra74

#29

Everything, except bullet, is taken from the Lichess Elite Database, which is updated once a month.

The creator of this database wrote: "I filtered all (standard) games from lichess to only keep games by players rated 2400+ against players rated 2200+, excluding bullet games. Edit: From December 2021 on, I only kept games of 2500+ vs. 2300+ rated players."

Everything, except bullet, is taken from the Lichess Elite Database, which is updated once a month. The creator of this database wrote: "I filtered all (standard) games from lichess to only keep games by players rated 2400+ against players rated 2200+, excluding bullet games. Edit: From December 2021 on, I only kept games of 2500+ vs. 2300+ rated players."

kajalmaya

#30

Do you know this website: https://chessnerd.net/? The person who maintains that website publishes interesting collections, and they are quite disjoint with yours, so maybe of interest.

This topic has been archived and can no longer be replied to.