@Lumbra74 said in #1:
https://www.yottachess.com/ has 14.680.341 games from 1.361.713 players.
It contains a lot of features and you download for free. It is more complete than Chessbase Database.
Are you aware about the yottabase? If yes what is your differentiator?
@Lumbra74 said in #1:
https://www.yottachess.com/ has 14.680.341 games from 1.361.713 players.
It contains a lot of features and you download for free. It is more complete than Chessbase Database.
Are you aware about the yottabase? If yes what is your differentiator?
SCID can remove the duplicates pretty easily under the Maintenance > delete twin games, so it's not a big deal. I do something similar with a little python script to download many weekly files from twic at once
SCID can remove the duplicates pretty easily under the Maintenance > delete twin games, so it's not a big deal. I do something similar with a little python script to download many weekly files from twic at once
@mvhk said in #11:
Thank you for the hint, but I checked the webseite and didn't find any option to download the whole database, except this site: https://www.yottachess.com/yottabase
But first, it doesn't seems to free of charge, second, it seems to be kind of a lottery to get the database.
As an online tool - great!! But offline, I don't really know.
@mvhk said in #11:
>
Thank you for the hint, but I checked the webseite and didn't find any option to download the whole database, except this site: https://www.yottachess.com/yottabase
But first, it doesn't seems to free of charge, second, it seems to be kind of a lottery to get the database.
As an online tool - great!! But offline, I don't really know.
@deepvalue124 said in #12:
SCID can remove the duplicates pretty easily under the Maintenance > delete twin games, so it's not a big deal. I do something similar with a little python script to download many weekly files from twic at once
and
@FBF saind in #10:
There are many duplicates in the players' names too.
For example you have 3 duplicates games played on 2002.12.08 between Diulger, Alexey vs Chirila, C. - Chirila, C.. - Chirila, Ioan-Cristian
The best approach would be to spell-check the player names and events before deleting the games.
If you search duplicate games with the options "same moves", "same year, month, day" and "First 4 letters only" more than 2 MILLION duplicate games are found.
I already did that. I'm right now trying to figur out the best set of combinations in these parameters.
Right now, I've selected the follwoing ones:
- Player names: exact match
- Same site (I will alternate that with "Same event")
- Same year
- Same moves
I'm currently checking the database for more issues like double ".." or leading/trailing spaces in the PGN tags.
Speeling correction otherwise is really hard. An if choosing only the first 4 letters, I don't trust this search function enough. It might be correct and I don't have any chance to check the result, if there are more than a few thousand results.
I'm not sure how to preceed after I cleared the most obvious issues discribed above.
Are you guy willing to have twin games in the database, or do you prefer to have no twin games and are willing to sacrifice an unknown number of unique games?
Regards
Michael/Lumbra74
@deepvalue124 said in #12:
> SCID can remove the duplicates pretty easily under the Maintenance > delete twin games, so it's not a big deal. I do something similar with a little python script to download many weekly files from twic at once
and
@FBF saind in #10:
>There are many duplicates in the players' names too.
>For example you have 3 duplicates games played on 2002.12.08 between Diulger, Alexey vs Chirila, C. - Chirila, C.. - Chirila, Ioan-Cristian
>The best approach would be to spell-check the player names and events before deleting the games.
>If you search duplicate games with the options "same moves", "same year, month, day" and "First 4 letters only" more than 2 MILLION duplicate games are found.
I already did that. I'm right now trying to figur out the best set of combinations in these parameters.
Right now, I've selected the follwoing ones:
- Player names: exact match
- Same site (I will alternate that with "Same event")
- Same year
- Same moves
I'm currently checking the database for more issues like double ".." or leading/trailing spaces in the PGN tags.
Speeling correction otherwise is really hard. An if choosing only the first 4 letters, I don't trust this search function enough. It might be correct and I don't have any chance to check the result, if there are more than a few thousand results.
I'm not sure how to preceed after I cleared the most obvious issues discribed above.
Are you guy willing to have twin games in the database, or do you prefer to have no twin games and are willing to sacrifice an unknown number of unique games?
Regards
Michael/Lumbra74
Hello out there,
Thank you for all your advice. I have reworked the database and reduced the doubled games much further.
I tried eliminate the doubled games, with still having checked either "Same site", "Same event" or "Same Round". But even after this around half a million duplicates were found. I have checked the first 100 games using the SCID dialog and found out, that the actual moves seems to be identical, indeed - but the location, the event name and even the rounds were vastly different.
So I got fed up and deleted the rest as well. :D
The current stats are now:
9.386.860 Games
467.825 Player
94.887 Events
26.367 Locations
https://LumbrasGigaBase.com
Regards,
Michael/Lumbra74
Hello out there,
Thank you for all your advice. I have reworked the database and reduced the doubled games much further.
I tried eliminate the doubled games, with still having checked either "Same site", "Same event" or "Same Round". But even after this around half a million duplicates were found. I have checked the first 100 games using the SCID dialog and found out, that the actual moves seems to be identical, indeed - but the location, the event name and even the rounds were vastly different.
So I got fed up and deleted the rest as well. :D
The current stats are now:
9.386.860 Games
467.825 Player
94.887 Events
26.367 Locations
https://LumbrasGigaBase.com
Regards,
Michael/Lumbra74
@FBF said in #10:
If you search duplicate games with the options "same moves", "same year, month, day" and "First 4 letters only" more than 2 MILLION duplicate games are found.
I was now able to reduce the duplicate games much further by following the path of your hint. Thanks for that.
The parameters I used to clena up the duplicates were now:
- first 4 letters of the player names
- same colors
- same year
- same result
- same moves
I added few new sources to the database, but some of these game were quite rubbish. Thankfully I edited the PGNs beforehand, so that I can find them using the added source tag and remove them again. In these games f.i both players were written in the white or black player tag - useless even for filtering for duplicates.
Current content of the database
- 8.675.290 Games
- 466.802 Player
- 85.834 Events
- 26.668 Locations
"Source" tags added:
A SOURCE tag has been added to all games:
"TWIC" – for all games from the TWIC download
"Lumbras Giga Base" – for all games whose source is not clearly visible, as they are from pre-compiled databases.
"Britbase" – The British Chess Game Archive
"ChessOK" – chessok.com still offers PGNs free of charge
"Chess Nostalgia" – Nothing to find anymore about this website
"PGN Mentor" - for all games donloaded from PGNMentor
PGN Mentor and TWIC are the sources I trust the most, to be quite clean. So I imported them last and removed the duplicate games with the lower database ID.
Regards,
Michael/Lumbra74
P.S.: TWIC 1526 is also added
@FBF said in #10:
> If you search duplicate games with the options "same moves", "same year, month, day" and "First 4 letters only" more than 2 MILLION duplicate games are found.
I was now able to reduce the duplicate games much further by following the path of your hint. Thanks for that.
The parameters I used to clena up the duplicates were now:
- first 4 letters of the player names
- same colors
- same year
- same result
- same moves
I added few new sources to the database, but some of these game were quite rubbish. Thankfully I edited the PGNs beforehand, so that I can find them using the added source tag and remove them again. In these games f.i both players were written in the white or black player tag - useless even for filtering for duplicates.
Current content of the database
- 8.675.290 Games
- 466.802 Player
- 85.834 Events
- 26.668 Locations
"Source" tags added:
A SOURCE tag has been added to all games:
"TWIC" – for all games from the TWIC download
"Lumbras Giga Base" – for all games whose source is not clearly visible, as they are from pre-compiled databases.
"Britbase" – The British Chess Game Archive
"ChessOK" – chessok.com still offers PGNs free of charge
"Chess Nostalgia" – Nothing to find anymore about this website
"PGN Mentor" - for all games donloaded from PGNMentor
PGN Mentor and TWIC are the sources I trust the most, to be quite clean. So I imported them last and removed the duplicate games with the lower database ID.
Regards,
Michael/Lumbra74
P.S.: TWIC 1526 is also added
Cool.
I have another suggestion: add lichess broadcasts as source ( https://lichess.org/api#tag/Broadcasts )
The main advantage is that those games include the movetime.
For example:
https://lichess.org/api/broadcast?nb=10
returns a JSON array with the last 10 tournaments.
Using the ID of the tournament, it is possible to download the PGN:
https://lichess.org/api/broadcast/HNi1NcRC.pgn
Here is a lichess_broadcasts.zip file that contains older tournaments:
https://sourceforge.net/projects/scid/files/Scid/Additional%20Files/
Cool.
I have another suggestion: add lichess broadcasts as source ( https://lichess.org/api#tag/Broadcasts )
The main advantage is that those games include the movetime.
For example:
https://lichess.org/api/broadcast?nb=10
returns a JSON array with the last 10 tournaments.
Using the ID of the tournament, it is possible to download the PGN:
https://lichess.org/api/broadcast/HNi1NcRC.pgn
Here is a lichess_broadcasts.zip file that contains older tournaments:
https://sourceforge.net/projects/scid/files/Scid/Additional%20Files/
I will take a look at it. But I think I would prefer mostly classical master games in that database. I might set up a second database, if the community likes to have something like that.
I will take a look at it. But I think I would prefer mostly classical master games in that database. I might set up a second database, if the community likes to have something like that.
@Lumbra74 said in #18:
I will take a look at it. But I think I would prefer mostly classical master games in that database.
Oh, I was unclear.
Those are classical master games.
For example this is the Tata Steel 2024:
https://lichess.org/api/broadcast/ycy5D2r8.pgn
Those games are already in your database but without the move times. And those are incredibly useful: you can determine when the players went out of the opening preparation, whether the players were in time pressure, and so on.
@Lumbra74 said in #18:
> I will take a look at it. But I think I would prefer mostly classical master games in that database.
Oh, I was unclear.
Those are classical master games.
For example this is the Tata Steel 2024:
https://lichess.org/api/broadcast/ycy5D2r8.pgn
Those games are already in your database but without the move times. And those are incredibly useful: you can determine when the players went out of the opening preparation, whether the players were in time pressure, and so on.
@FBF said in #19:
Oh, I was unclear.
Those are classical master games.
For example this is the Tata Steel 2024:
lichess.org/api/broadcast/ycy5D2r8.pgn
Those games are already in your database but without the move times. And those are incredibly useful: you can determine when the players went out of the opening preparation, whether the players were in time pressure, and so on.
Ah ok, I understand. Yes, they are coming through TWIC into the database, right now - thats interesting. But my version of Scid doesn't display the move times, except in the pgn itself.
This json is very ugly, do you know a script, which downloads the broadcast PGNs? I guess there should be something for that, out in the world...
Edit: Found a script, that parses the json...
@FBF said in #19:
> Oh, I was unclear.
> Those are classical master games.
> For example this is the Tata Steel 2024:
> lichess.org/api/broadcast/ycy5D2r8.pgn
>
> Those games are already in your database but without the move times. And those are incredibly useful: you can determine when the players went out of the opening preparation, whether the players were in time pressure, and so on.
Ah ok, I understand. Yes, they are coming through TWIC into the database, right now - thats interesting. But my version of Scid doesn't display the move times, except in the pgn itself.
This json is very ugly, do you know a script, which downloads the broadcast PGNs? I guess there should be something for that, out in the world...
Edit: Found a script, that parses the json...