@Shreksify said in #10:
@dboing Glad I could help! :) Please do update me with your experiments. It's not everyday I bump unto someone who test Chess softwares :)
Testing for exploring... i get to see things of chess tangentially....:) and i give myself no ETA... will keep trying. Also erratic pace and might not follow up for long time.. but i get to loop around eventually (i hope).
now gotten curious about Banksguia valourous effort at proposing an open database format for chess.
https://github.com/nguyenpham/ocgdb
I can't but salute that, as a step forward data analysis with reproducible support, even in chess....
(tangential should be my middle name, with some ability to spiral in or out, a new tangent on the river and drift: "things are what they are" life paradigm).
@Shreksify said in #10:
> @dboing Glad I could help! :) Please do update me with your experiments. It's not everyday I bump unto someone who test Chess softwares :)
Testing for exploring... i get to see things of chess tangentially....:) and i give myself no ETA... will keep trying. Also erratic pace and might not follow up for long time.. but i get to loop around eventually (i hope).
now gotten curious about Banksguia valourous effort at proposing an open database format for chess.
https://github.com/nguyenpham/ocgdb
I can't but salute that, as a step forward data analysis with reproducible support, even in chess....
(tangential should be my middle name, with some ability to spiral in or out, a new tangent on the river and drift: "things are what they are" life paradigm).
that opensource version uses standard table engine at bottom so it cannot be the fastest. I have no clue what sort of data storage would be optimal for chess. It is a very specific problems that need to be solved. Like in health care problems graph databases outperform sql databases due to natureof queries. In chess I think that whole bunch metadata shoudl be incexed as well as actual positions. Hence unlikely anyone is going outperform chessbase anytime soon.
that opensource version uses standard table engine at bottom so it cannot be the fastest. I have no clue what sort of data storage would be optimal for chess. It is a very specific problems that need to be solved. Like in health care problems graph databases outperform sql databases due to natureof queries. In chess I think that whole bunch metadata shoudl be incexed as well as actual positions. Hence unlikely anyone is going outperform chessbase anytime soon.
@petri999 said in #12:
that opensource version uses standard table engine at bottom so it cannot be the fastest. I have no clue what sort of data storage would be optimal for chess. It is a very specific problems that need to be solved. Like in health care problems graph databases outperform sql databases due to natureof queries. In chess I think that whole bunch metadata shoudl be incexed as well as actual positions. Hence unlikely anyone is going outperform chessbase anytime soon.
Maybe the same optimization goal of performance for speed and size need to be part of the discussion.. I think that the link above is not about limiting the discussion there. In health care systems, there might be immediacy pressure, and no respawn context...
but aside from the time control final frontier of photon time (i mean bullet), perhaps transparency and human readability should come back in the balance. Machines are still tools.. I find that we have, as many individuals in society, done a lot in so few generations to keep adapting to the computers interfaces and their programming designs.. Some up and down on what is adapting to what. sorry. i spilled over to non-chess life.. but it might help framing that discussion..
this is not just a technical computer question... data analysis isn't either... and chess is still needing humans to play human chess.. (too much...meaning we are not shrinkable or multi-corable, when shrinking does not work anymore).
So i keep giving imaginary kudos to that repo readme, which seem to have put the debate with all the necessary context.
a very recent discovery of mine.. thanks to @Shreksify for his shared database of big size for motivating me into making this thread.
I hope it is ok to share. The following is the test case i have been using so far.
https://sites.google.com/view/kampfschach/blitzabase/blitzabase2023
other databases (game and book pairs) of big size availbel on following page. (next test case for me, added to previous one, i still am exploring the world of chess DBs, as newbie).
https://banksiagui.com/otherdownloads/
Edit: and yes part of my discovery process is about the interplay between metadata, and the core raw mostly error free games sequences. But that format above. seems to be like a serious attempt, at least putting some more structure than PGN... and paving the ways for other formats if old optimizatino scheme still applicable, before human readability and compatibility.
it is a step in the right direction, that binary format, proprietary on top, could not go at tall. and plenty of tools to dock with better choices for other contexts similar to health care situations you might have been considereing.
This is not yet a proposal for a lichess live SQL database to be queried individually but any lichess users (no, only by me on my agin latptop so far).. Yet,.. it was tested as reported on that page... some 65 millions or billions games.. (i stop counting before that, 1, 2, many, infinitiy.. maybe also .. big finite number). maybe 95 M or B.... 10^6+
@petri999 said in #12:
> that opensource version uses standard table engine at bottom so it cannot be the fastest. I have no clue what sort of data storage would be optimal for chess. It is a very specific problems that need to be solved. Like in health care problems graph databases outperform sql databases due to natureof queries. In chess I think that whole bunch metadata shoudl be incexed as well as actual positions. Hence unlikely anyone is going outperform chessbase anytime soon.
Maybe the same optimization goal of performance for speed and size need to be part of the discussion.. I think that the link above is not about limiting the discussion there. In health care systems, there might be immediacy pressure, and no respawn context...
but aside from the time control final frontier of photon time (i mean bullet), perhaps transparency and human readability should come back in the balance. Machines are still tools.. I find that we have, as many individuals in society, done a lot in so few generations to keep adapting to the computers interfaces and their programming designs.. Some up and down on what is adapting to what. sorry. i spilled over to non-chess life.. but it might help framing that discussion..
this is not just a technical computer question... data analysis isn't either... and chess is still needing humans to play human chess.. (too much...meaning we are not shrinkable or multi-corable, when shrinking does not work anymore).
So i keep giving imaginary kudos to that repo readme, which seem to have put the debate with all the necessary context.
a very recent discovery of mine.. thanks to @Shreksify for his shared database of big size for motivating me into making this thread.
I hope it is ok to share. The following is the test case i have been using so far.
https://sites.google.com/view/kampfschach/blitzabase/blitzabase2023
other databases (game and book pairs) of big size availbel on following page. (next test case for me, added to previous one, i still am exploring the world of chess DBs, as newbie).
https://banksiagui.com/otherdownloads/
Edit: and yes part of my discovery process is about the interplay between metadata, and the core raw mostly error free games sequences. But that format above. seems to be like a serious attempt, at least putting some more structure than PGN... and paving the ways for other formats if old optimizatino scheme still applicable, before human readability and compatibility.
it is a step in the right direction, that binary format, proprietary on top, could not go at tall. and plenty of tools to dock with better choices for other contexts similar to health care situations you might have been considereing.
This is not yet a proposal for a lichess live SQL database to be queried individually but any lichess users (no, only by me on my agin latptop so far).. Yet,.. it was tested as reported on that page... some 65 millions or billions games.. (i stop counting before that, 1, 2, many, infinitiy.. maybe also .. big finite number). maybe 95 M or B.... 10^6+
I just tested pgnscid (which is a script distributed with scid vs pc for converting games in a pgn file to the scid database format).
A database of 729456 games (file size 538 MB) - 32 seconds (nearly 22795 games per second)
A database of 4289623 games (file size 1.360 GB) - 3 minutes 40 seconds (nearly 19498 games per second).
Building a polyglot bin opening book (with the program polyglot):
From 729456 games: 1 min 55 sec.
On the 4M games it failed because my database contains games with errors as well as some funny characters, which I have to clean up first. But on my laptop it would probably be too much because already for 729456 game collection, it allocated 1280 MB memory by the time it completed. The memory need of the program will not grow linearly with size but exponentially because every time it needs more memory during execution, it allocates about double that of the previous step.
But this is quite fast. This is on an ordinary laptop with 4GB ram, Intel i5 processor.
@dboing said in #5:
- Banksguia gui.. is slow, but not giving up yet. actually giving good peek on its progression in more than one variable.
data
converting from PGN to OCGDB:
some 800,000 games so far.. 40 minutes loading.. speed of 282 games/s. very slow. aboring.
just loading as PGN lead to BG converint in own internal format at 6000 games/s.. more reasonable.
intent to try scid. and even arena later.. any suggestion of other software free or open source (best for many reasons).
I just tested pgnscid (which is a script distributed with scid vs pc for converting games in a pgn file to the scid database format).
A database of 729456 games (file size 538 MB) - 32 seconds (nearly 22795 games per second)
A database of 4289623 games (file size 1.360 GB) - 3 minutes 40 seconds (nearly 19498 games per second).
Building a polyglot bin opening book (with the program polyglot):
From 729456 games: 1 min 55 sec.
On the 4M games it failed because my database contains games with errors as well as some funny characters, which I have to clean up first. But on my laptop it would probably be too much because already for 729456 game collection, it allocated 1280 MB memory by the time it completed. The memory need of the program will not grow linearly with size but exponentially because every time it needs more memory during execution, it allocates about double that of the previous step.
But this is quite fast. This is on an ordinary laptop with 4GB ram, Intel i5 processor.
@dboing said in #5:
> 2) Banksguia gui.. is slow, but not giving up yet. actually giving good peek on its progression in more than one variable.
>
> data
> converting from PGN to OCGDB:
> some 800,000 games so far.. 40 minutes loading.. speed of 282 games/s. very slow. aboring.
> just loading as PGN lead to BG converint in own internal format at 6000 games/s.. more reasonable.
>
> intent to try scid. and even arena later.. any suggestion of other software free or open source (best for many reasons).
@kajalmaya said in #14:
I just tested pgnscid (which is a script distributed with scid vs pc for converting games in a pgn file to the scid database format).
https://scidvspc.sourceforge.net/doc/Pgnscid.htm
https://chess.stackexchange.com/questions/22771/replacement-for-pgnscid
(something about the command going away for some versions, perhaps why you described it as script).
https://github.com/CasualPyDev/pgn2scid (a wrapper, not yet tested)
thank you for that. i have also tried scid vs pc. i may have forgotten or postponed its reporting.. my machine slow.. loading same file as before (7 GB PGN), happens roughly same speed as banksguia just loading the PGN (not about converting, it takes hours, but i am willing to work with a format that has lots of relational non-chess DB (there also chess now) tools from mature ecosystem (see the repo link for the pros, and weathered cons (reader can adapt to own context with all cards in hands).
so it load fine.. and in retropspect probably better than CBR 2017, in the sense that scid vs PC gives out a log of error upon loading read only huge PGN... while the same loading in CBR gave no clue about such errors or typos (or encoding, still not sure, and only small portion of test case). Both software give same number of game parsed.
my laptop has probably same spec. but 7GB did not take just a few minutes to just load ... it had some ETA maybe.. i could time it.... later.. to be fair to sci vs PC.
Do you know of relational database feature propects about scid own DB format (and its original purpose). I guess public format, and more computer efficient than PGN.. It might also sustain the query features present in scid.
I like that the proposed OCGDB format sits on top of mature ecosystem of database manipulation tools (emphasis on mature and ecosystem, less wheel reinventing for things that are not proper to chess data). That way all conventions of chess can be added and been preserved without interference from that underlying non-chess specific general relational database constraints (it would not have any, or lose its essential purpose of linking many sorts of data types together.. my naive understanding of relational databases. Why standardize what does not need too, or why need to make it implemented at the lowest levels.. again naive perhaps.
I am actually confident now that scid is made to hanlde database of size i have not yet tried.. so it may be another kind of baseline. or fallback. for its long usage tradition. I do not think of it have obsolete, as before i found out about OCGDB/SQL i thought it was the only alternative to chessbase own proprietary database format.. So chessbase gives the luxury of modern GUI and speed for database manipulation as long as PGN or its own non portable format (not open and not tapping on other accessible tools, i would use).
Your script does alleviate my inability to figure out how to actually save as its internal format, and not have each time to wait for it to load... I am testing and discovering many things at the same times. and probably missing obvious things too.
Thanks for your data points. too. Happy new year.
@kajalmaya said in #14:
> I just tested pgnscid (which is a script distributed with scid vs pc for converting games in a pgn file to the scid database format).
https://scidvspc.sourceforge.net/doc/Pgnscid.htm
https://chess.stackexchange.com/questions/22771/replacement-for-pgnscid
(something about the command going away for some versions, perhaps why you described it as script).
https://github.com/CasualPyDev/pgn2scid (a wrapper, not yet tested)
thank you for that. i have also tried scid vs pc. i may have forgotten or postponed its reporting.. my machine slow.. loading same file as before (7 GB PGN), happens roughly same speed as banksguia just loading the PGN (not about converting, it takes hours, but i am willing to work with a format that has lots of relational non-chess DB (there also chess now) tools from mature ecosystem (see the repo link for the pros, and weathered cons (reader can adapt to own context with all cards in hands).
so it load fine.. and in retropspect probably better than CBR 2017, in the sense that scid vs PC gives out a log of error upon loading read only huge PGN... while the same loading in CBR gave no clue about such errors or typos (or encoding, still not sure, and only small portion of test case). Both software give same number of game parsed.
my laptop has probably same spec. but 7GB did not take just a few minutes to just load ... it had some ETA maybe.. i could time it.... later.. to be fair to sci vs PC.
Do you know of relational database feature propects about scid own DB format (and its original purpose). I guess public format, and more computer efficient than PGN.. It might also sustain the query features present in scid.
I like that the proposed OCGDB format sits on top of mature ecosystem of database manipulation tools (emphasis on mature and ecosystem, less wheel reinventing for things that are not proper to chess data). That way all conventions of chess can be added and been preserved without interference from that underlying non-chess specific general relational database constraints (it would not have any, or lose its essential purpose of linking many sorts of data types together.. my naive understanding of relational databases. Why standardize what does not need too, or why need to make it implemented at the lowest levels.. again naive perhaps.
I am actually confident now that scid is made to hanlde database of size i have not yet tried.. so it may be another kind of baseline. or fallback. for its long usage tradition. I do not think of it have obsolete, as before i found out about OCGDB/SQL i thought it was the only alternative to chessbase own proprietary database format.. So chessbase gives the luxury of modern GUI and speed for database manipulation as long as PGN or its own non portable format (not open and not tapping on other accessible tools, i would use).
Your script does alleviate my inability to figure out how to actually save as its internal format, and not have each time to wait for it to load... I am testing and discovering many things at the same times. and probably missing obvious things too.
Thanks for your data points. too. Happy new year.
@kajalmaya said in #14:
about funny characters.. utf8 in players names is that something i should worry about? The errors i had in scid log are about FEN parts, but error caught might not tell its cause, it might just block in the consequences during parsing. (i imagine). But i wonder in general among all chess software (not online), have they all gotten robust to utf strings.. I don't see why funny characters in inert content with no parsing syntax meaning would be a problem, but one never knows with magic computer things... :)
https://scidvspc.sourceforge.net/doc/Encoding.htm
I am really amazed by the speeds you mention. definitely on my TODO to reproduce.
@kajalmaya said in #14:
>
about funny characters.. utf8 in players names is that something i should worry about? The errors i had in scid log are about FEN parts, but error caught might not tell its cause, it might just block in the consequences during parsing. (i imagine). But i wonder in general among all chess software (not online), have they all gotten robust to utf strings.. I don't see why funny characters in inert content with no parsing syntax meaning would be a problem, but one never knows with magic computer things... :)
https://scidvspc.sourceforge.net/doc/Encoding.htm
I am really amazed by the speeds you mention. definitely on my TODO to reproduce.
https://scidvspc.sourceforge.net/doc/Formats.htm
The move encoding format is very compact: most moves take only a single byte (8 bits)! This is done by storing the piece to move in 4 bits (2^4 = 16 pieces) and the move direction in another 4 bits.
Otherwise, the format seems to be tabulating the PGN unstructured format (at the many game level), using 3 files.. (CB also likes to use lots of sibling files, some herding there required if not having a CB dedicated zone, on your drive). That is a minor point. just saying don't loose any....
It appears to be also in the direction of putting some structure onto PGN databases. and while at it, some binary encoding saving on space (and time probably as consequence). That is something valuable.. if intended to restrict oneself to always use scid. Not proprietary, but not having connection to other tools existing and specialized in DB of many kind. CB does not either by the way. Not that we could know. So, maybe, as I am being clear about my motivation and potential goals, it really depends on the nature of projected usage.
I can see that the user case of personal chess data management, or well controlled size of very good games from reknown sources to study and inspect as one playing trajectory or study plan needs, might not have the same concern about best database format. Both CB and scid seem to have been used mainly in that fashion, if I am hearing or reading well. So, the very question of limits as i have been testing might not even be such an important question in that context....
However, can't have it all in one magic package... and this here is about cerning the limites of using GUI chess software and opposed to programmable libraries (or APIs) though limited graphically aided interfaces, and usually about parsing very long strings.... (:).
https://scidvspc.sourceforge.net/doc/Formats.htm
> The move encoding format is very compact: most moves take only a single byte (8 bits)! This is done by storing the piece to move in 4 bits (2^4 = 16 pieces) and the move direction in another 4 bits.
Otherwise, the format seems to be tabulating the PGN unstructured format (at the many game level), using 3 files.. (CB also likes to use lots of sibling files, some herding there required if not having a CB dedicated zone, on your drive). That is a minor point. just saying don't loose any....
It appears to be also in the direction of putting some structure onto PGN databases. and while at it, some binary encoding saving on space (and time probably as consequence). That is something valuable.. if intended to restrict oneself to always use scid. Not proprietary, but not having connection to other tools existing and specialized in DB of many kind. CB does not either by the way. Not that we could know. So, maybe, as I am being clear about my motivation and potential goals, it really depends on the nature of projected usage.
I can see that the user case of personal chess data management, or well controlled size of very good games from reknown sources to study and inspect as one playing trajectory or study plan needs, might not have the same concern about best database format. Both CB and scid seem to have been used mainly in that fashion, if I am hearing or reading well. So, the very question of limits as i have been testing might not even be such an important question in that context....
However, can't have it all in one magic package... and this here is about cerning the limites of using GUI chess software and opposed to programmable libraries (or APIs) though limited graphically aided interfaces, and usually about parsing very long strings.... (:).
@dboing
Sorry, I called it script, but I checked that it is a compiled file. Also, it is still available with the distribution, but maybe they don't develop it further because scid can directly import games. But just to clarify a little: with pgnscid you convert the pgn database to the scid files only once, and then from scid you will open them almost instantly (i.e., you don't have to wait for it for even 30 seconds that I mentioned for converting to scid format). So next time you don' t open the pgn files, but open the scid files.
I don't think scid database format would be inferior to chessbase, and I doubt chessbase has any special knowledge about databases or efficient programs or algorithms, but chessbase could be feature rich. But I don ́t know how databases are implemented. One observation I made yesterday: if you take a pgn file and simply do zip or gzip on it, it compresses it a lot compared to many other files. I think this is because of the repetition of text sequences. The idea is somewhat like if you have two sequences abc and abd, you don't need 6 characters to store them together if you can use the fact that they share the part ab. This is very much the case with chess games with a lot of repetition of strings (even beyond the opening phase). But zip or gzip may be excellent for compression, but may not be good for searching. So for chess you have to have different compression scheme that is also convenient for searching.
It is likely that scid uses 3 files because, for example, you could handle metadata separately so that a lot of searches can be done fast by looking at only the metadata, and once you have found the games of interest, you could fetch them from another file. There is also a hashing scheme by which you can represent a position or an entire game by a shorter string, which helps in searches. It is called zobrist hashing https://en.wikipedia.org/wiki/Zobrist_hashing. I don't know if scid uses it, but it is very common in the computer chess world.
Regarding sources of games: http://www.nk-qy.info/40h/ is a high quality database of more than 700k games. It has only players rated 2400+, and short games under 5 moves and games with early blunders (before 15 or 20 moves) are removed. Another is caissabase http://caissabase.co.uk/ 5M games (but not as well curated as 40h) and https://ajedrezdata.com/ (probably a little more than caissabase, but again not as well curated as 40h, and contains errors). When I say errors, I mean this: a tool known as pgn-extract is fantastic, extremely fast, and can do a lot of things with large pgn files (just take a look at the help page https://www.cs.kent.ac.uk/people/staff/djb/pgn-extract/help.html to get an idea about what it can do). I believe pgn-extract is very strict in following the specification of pgn format, so some errors that may slip into pgn files will be caught by pgn-extract. So if I download anything, I first run it through pgn-extract to check for errors, remove unwanted tags, and format it well.
I thought about character encoding. I also use a tool called dos2unix (I work only on linux). But in some pgn file, polyglot complained of illegal move, and also somewhere detected an illegal invisible character. I have not looked into this much because I personally do not need very large databases like 5M or 10M games. I ain't a pro. But I am interested in the computational/software aspects so I play with these tools.
You can ask questions about banksia gui on talkchess.com, where the main developer of Banksia GUI frequently posts, discusses developments and answers questions.
@dboing
Sorry, I called it script, but I checked that it is a compiled file. Also, it is still available with the distribution, but maybe they don't develop it further because scid can directly import games. But just to clarify a little: with pgnscid you convert the pgn database to the scid files only once, and then from scid you will open them almost instantly (i.e., you don't have to wait for it for even 30 seconds that I mentioned for converting to scid format). So next time you don' t open the pgn files, but open the scid files.
I don't think scid database format would be inferior to chessbase, and I doubt chessbase has any special knowledge about databases or efficient programs or algorithms, but chessbase could be feature rich. But I don ́t know how databases are implemented. One observation I made yesterday: if you take a pgn file and simply do zip or gzip on it, it compresses it a lot compared to many other files. I think this is because of the repetition of text sequences. The idea is somewhat like if you have two sequences abc and abd, you don't need 6 characters to store them together if you can use the fact that they share the part ab. This is very much the case with chess games with a lot of repetition of strings (even beyond the opening phase). But zip or gzip may be excellent for compression, but may not be good for searching. So for chess you have to have different compression scheme that is also convenient for searching.
It is likely that scid uses 3 files because, for example, you could handle metadata separately so that a lot of searches can be done fast by looking at only the metadata, and once you have found the games of interest, you could fetch them from another file. There is also a hashing scheme by which you can represent a position or an entire game by a shorter string, which helps in searches. It is called zobrist hashing https://en.wikipedia.org/wiki/Zobrist_hashing. I don't know if scid uses it, but it is very common in the computer chess world.
Regarding sources of games: http://www.nk-qy.info/40h/ is a high quality database of more than 700k games. It has only players rated 2400+, and short games under 5 moves and games with early blunders (before 15 or 20 moves) are removed. Another is caissabase http://caissabase.co.uk/ 5M games (but not as well curated as 40h) and https://ajedrezdata.com/ (probably a little more than caissabase, but again not as well curated as 40h, and contains errors). When I say errors, I mean this: a tool known as pgn-extract is fantastic, extremely fast, and can do a lot of things with large pgn files (just take a look at the help page https://www.cs.kent.ac.uk/people/staff/djb/pgn-extract/help.html to get an idea about what it can do). I believe pgn-extract is very strict in following the specification of pgn format, so some errors that may slip into pgn files will be caught by pgn-extract. So if I download anything, I first run it through pgn-extract to check for errors, remove unwanted tags, and format it well.
I thought about character encoding. I also use a tool called dos2unix (I work only on linux). But in some pgn file, polyglot complained of illegal move, and also somewhere detected an illegal invisible character. I have not looked into this much because I personally do not need very large databases like 5M or 10M games. I ain't a pro. But I am interested in the computational/software aspects so I play with these tools.
You can ask questions about banksia gui on talkchess.com, where the main developer of Banksia GUI frequently posts, discusses developments and answers questions.
@kajalmaya said in #18:
This is very much the case with chess games with a lot of repetition of strings (even beyond the opening phase)
Yes, nice that you explain compression here. It made me think, that knowing that, you would think people might want to skip the preambles, and start right there... but i get it, so many patzers weeded out that way.. big ELO gain. I could not help but find and extract this juicy bit of information out of your text.. Thank you. smiling. Also, i would suggest the blog with "entropy" or was it "information" in the title.
Back to ... something:
Thanks for your thoughts. Yes. I also did the experiment again.. With lastest SCIDvsPC. will try to wrap up later on timing issues.
I did look at the relative sizes.. I had tried to convert using CBR 2017, but on my machine it stalled on the 7GB test case, and I would not be able to check anyway, not having the CB itself. I also never thought scid would be inferior in what it leaves as persistent output.
ordering by file format sizes, with PGN at the end, omitting CB because of above.
7z, scid format, Banksia, PGN. so far. like 2, 3, 4, 7. roughly. scid actually very compact nearing 7z (on windows).
I did not use the lastest Banksia, but the 55 version. however i should redo the experiments with the command line standalone coverters later. ocgdb also having such a executable (see link in some post above for its repo).
I would expect that the SQL friendly format would take longer to digest from PGN, being about docking with existing database tools beyond chess, and human readability to some extent still (when needed, per repo author proposition).
It also takes double the size from that of scid. I have not yet reaped (past tense?) the database manipulation, but both software can almost instantly load back their own format..
Not PGN, which has not whole database wide structure to guide the software beyond having to sequentially scan through the PGN to hit bottom.. The structured formats would likely have all the necessary internal addressing handling the database as a whole rather than a stream of games (my understanding), so not needing to hit the bottom of that each time.
Thank you for your own sharing of relevant links to databases issues or further information sources... Appreciated for myself, or other onlooker of this thread...
I will have something more tabular hopefully laters about timing and sizes of conversions. Possibly someone coud use my 7GB test case (link somewhere up), for the CB part... instead of half baked reader experiments.
I could maybe give more wear and tear by heat CPU power to avoid CBR hanging, no more throttling. I am not throttling that much... and same treatment for all.. my laptop being the thing i want to keep going (sandy bridge intel, but i5 too, still not doing well with heat back then, re-pasted once, pain in the neck, no redoing that).
I remind that my question is about exploring software limits as single user with possibility of doing database operations on population level chess databases. But the information, we can get with that, is likely still useful for other goals users might have with chess databases. But i find it helpful to clarify my slant.
@kajalmaya said in #18:
> This is very much the case with chess games with a lot of repetition of strings (even beyond the opening phase)
Yes, nice that you explain compression here. It made me think, that knowing that, you would think people might want to skip the preambles, and start right there... but i get it, so many patzers weeded out that way.. big ELO gain. I could not help but find and extract this juicy bit of information out of your text.. Thank you. smiling. Also, i would suggest the blog with "entropy" or was it "information" in the title.
Back to ... something:
Thanks for your thoughts. Yes. I also did the experiment again.. With lastest SCIDvsPC. will try to wrap up later on timing issues.
I did look at the relative sizes.. I had tried to convert using CBR 2017, but on my machine it stalled on the 7GB test case, and I would not be able to check anyway, not having the CB itself. I also never thought scid would be inferior in what it leaves as persistent output.
ordering by file format sizes, with PGN at the end, omitting CB because of above.
7z, scid format, Banksia, PGN. so far. like 2, 3, 4, 7. roughly. scid actually very compact nearing 7z (on windows).
I did not use the lastest Banksia, but the 55 version. however i should redo the experiments with the command line standalone coverters later. ocgdb also having such a executable (see link in some post above for its repo).
I would expect that the SQL friendly format would take longer to digest from PGN, being about docking with existing database tools beyond chess, and human readability to some extent still (when needed, per repo author proposition).
It also takes double the size from that of scid. I have not yet reaped (past tense?) the database manipulation, but both software can almost instantly load back their own format..
Not PGN, which has not whole database wide structure to guide the software beyond having to sequentially scan through the PGN to hit bottom.. The structured formats would likely have all the necessary internal addressing handling the database as a whole rather than a stream of games (my understanding), so not needing to hit the bottom of that each time.
Thank you for your own sharing of relevant links to databases issues or further information sources... Appreciated for myself, or other onlooker of this thread...
I will have something more tabular hopefully laters about timing and sizes of conversions. Possibly someone coud use my 7GB test case (link somewhere up), for the CB part... instead of half baked reader experiments.
I could maybe give more wear and tear by heat CPU power to avoid CBR hanging, no more throttling. I am not throttling that much... and same treatment for all.. my laptop being the thing i want to keep going (sandy bridge intel, but i5 too, still not doing well with heat back then, re-pasted once, pain in the neck, no redoing that).
I remind that my question is about exploring software limits as single user with possibility of doing database operations on population level chess databases. But the information, we can get with that, is likely still useful for other goals users might have with chess databases. But i find it helpful to clarify my slant.
we ain't pros.. and this is why this thread is useful. we can talk like to other non-pros... and hopefully, the pros might be feeling generous, and adujst things that would scream too much, at reading, here. As non-pro, i find this informative so far.. if one not minding the conversation, journalling and casual style. (some might conjure the word rambling.... or out of texto norms, or keyboard superiority complex, or 10 finger gloating).
we ain't pros.. and this is why this thread is useful. we can talk like to other non-pros... and hopefully, the pros might be feeling generous, and adujst things that would scream too much, at reading, here. As non-pro, i find this informative so far.. if one not minding the conversation, journalling and casual style. (some might conjure the word rambling.... or out of texto norms, or keyboard superiority complex, or 10 finger gloating).