Chest compression game
I'm interested in chess and compression, so the compression of chess games is a natural thing to nerd out on. The PGN format and the algebraic notation it contains is human readable but spaceinefficient, I'm noodling around trying to minimise the footprint of the moves from a set of games. As this is just for fun only standard complete games are considered, and metadata is ignored as that goes into normal compression which is not the point of the exercise.
Here's the representations so far: [LIST][*]SG4  Standard game representation format used in SI4 for comparison, the database format of scid[*]PGN  Standard PGN format[*]ANA  Newline delimited algebraic notation, PGN without the metadata for a fairer comparison[*]MLA  Move list array  Games are represented as a list of moves where each move is two bytes (source and destination, the upper two bits of destination denote promotion target if promoting)[*]ILA  Index list array  At each position all valid moves are generated, and the index of the move is stored, allowing a single byte to represent the move[*]IPA  Packed index list array  A list of indexes where each index only takes up as many bits as required instead of always a byte. For example if there are only 13 valid moves in a position only 4 bits are required to store the index.[/LIST] Here's comparisons using a corpus of 4096 random games (random as in a standard game where every valid move is chosen at random, ending decisively or in a draw by the 50 move rule), where PGN metadata is minimal boilerplate to pass scid validation: [code]Format Filesize %ofANA XZsize %ofANA.xz SG4 1722293 17.9% 1289764 44.5% PGN 9996400 103.9% 2907876 100.3% ANA 9623664 100.0% 2897808 100.0% MLA 3321282 34.5% 2139160 73.8% ILA 1664735 17.3% 1015896 35.1% IPA 915582 9.5% 910660 31.4%[/code]There's a potential optimisation to be made in the ordering of valid moves in a position, if more likely valid moves are listed earlier then chessunaware compressors can perform better as it skews index frequency (assuming real games not random). The proper way to do this would be to use stockfish, probably more trouble than it's worth for noodling but potentially a strong step albeit computationally expensive. None of the above representations consider redundancy across games (SG4 unknown), openings being the low hanging fruit (again, for real nonrandom games). There's a few obvious possibilities but the one that seems most suitable is to to store a serialised tree structure efficiently encoding the openings, not the entirety of all games but enough to encode the easy redundancy without incurring too much overhead. Above is pretty much everything considered so far, there may be avenues or existing solutions I'm oblivious to which would be interesting to learn about. How can it be refined further or differently? 
Perhaps adding Huffman encoding on top of IPA can show some improvement.
And one step further is full algebraic encoding as is used for compression in space probes where every bit sent is expensive. 
I'm not familiar with huffman coding with input symbols of variable width, it has potential particularly when smart ordering is implemented. On the other hand when smart ordering is implemented the ILA representation also benefits, a real compressor can probably benefit more than whatever I can cook up even given the headstart of a packed representation. There is something that might tip the scales in huffmanencoded IPA's favour, and that's the fact that the IPA representation is not optimal. On average a packed index is something like half a bit inefficient, in the example above 4 bits to represent 13 states is wasting 3 states. I was considering using the remaining states to encode multiple moves in a single go, the most viable followups to the most viable current moves, but I don't think the current state of my cranklevelspaghetticode is up to the task. Huffman or similar might work and would be simpler.
A rudimentary ordering might work well enough that stockfish and all the drawbacks associated with using it isn't worth the effort. The ordering could be something like: [LIST][*]Castling, when available it's normally done pretty quickly[*]Taking a piece with no recourse[*]Taking a higher value piece with recourse[*]Taking an equal value piece with recourse[*]Moving a piece to a safe square[*]Moving a piece to a square covered by the opponent[*]Taking a lower value piece with recourse, sacrifice plays are relatively rare[/LIST] That should give a compressor some nicer value distributions to chew on. I'm open to suggestions for refinement, this is just easily implementable as cover masks have already been implemented for validating moves. 
Switched to Caissabase for testing, a database of over 4 million real games with strong players. To do that PGN ingest is implemented, although it's far from robust it can import all games from Caissabase excluding FEN, variants, the odd games with  denoting an unknown move and the 5 games with illegal castling moves. It's apparent that the engine needs some optimising to handle this many games quick enough for the current formats let alone the next formats to try, regardless here's rough preliminary data:
[code] Format Size(MB) %ofANA XzSize %ofANA.xz SG4 508 26.7% 277 60.5% PGN 2946 155.0% 610 133.2% ANA 1901 100.0% 458 100.0% MLA 685 36.0% 329 71.8% ILA 307 16.1% 150 32.8% IPA 234 12.3% 193 42.1% [/code]xz compressed ILA beats xz compressed IPA showing that IPA is far from optimal. The decent compression of IPA by xz also highlights that the IPA representation needs work. The file format of ILA has been reworked into a Cstring instead of the former representation which used a short as a move count. TODO: [LIST][*]Optimise masks[*]Smart ordering as described above, doable once masks optimised[*]Tree format to deduplicate openings. This is largely done just waiting on some other components[*]FEN? Setting board state to/from FEN is trivial, the part that needs some thought is how to encode it into an archive format[*]Chess960? If FEN is supported it's probably not much effort to support this mode, just some reworking of the castling rules which is the only thing different from a standard game AFAIK[*]Metadata preprocessor. There's no point reinventing standard compression, but a simple preprocessor that greatly reduces the input to a standard compressor could make sense. Encode tags, group related strings, etc[/LIST] 
[QUOTE=M344587487;570138][*]IPA  Packed index list array  A list of indexes where each index only takes up as many bits as required instead of always a byte. For example if there are only 13 valid moves in a position only 4 bits are required to store the index.[/LIST][/QUOTE]
You can reach almost the optimal log(13)/log(2) bits difficulty when you have 13 possibilities for a move. The idea is that for each p number of possibility maintain a seperate array to extract the moves. Say for 13 possibilities: [CODE] unsigned int tab13[100000]; [/CODE] Since 13^121<(2^32)^14 in only 14 ints you can encode 121 chess moves where in each move you have only 13 different valid chess moves. So you need (32*14)/121=3.7025 bits and not 4 bits. Ofcourse going higher you can get a closer distance to the optimal log(13)/log(2)=3.7004 bits, and optionally you can use only 1D array to compress all these info into a single array instead of using large number of tab2,tab3,tab4,..,tab13,tab14,.. arrays. 
Even easier way if you want to encode only a single/few game:
use a generalized number system, say in the ith move you have only B[i] valid moves you have chosen the r[i]th move from these. And the game had L move then define x=r[0]+r[1]*B[0]+r[2]*B[0]*B[1]+... from this number you can easily extract the r[i] numbers: r[0]=x mod B[0] (and doesn't depend on future moves r[1],r[2],..) r[1]=(xr[0])/B[0] mod B[1] etc. with this you would need to use ceil(log2(B[0])+log2(B[1])+..+log2(B[L1])) bits of memory, hence if you are using this encoding for each game then you'd lose at most 1 bit per game, in average 0.5 bit. 
Thank you, the generalised number solution is nice I'll implement it with GMP. If smart ordering is assumed a standard compressor compressing the ILA form should beat it, but it by far beats the IPA representation. If handling each game individually I think it can be archived as a length byte followed by the number stored in <length> bytes; 99% of games in Caissabase can fit in this form, the remaining few hundred can be encoded with the length as a varint, 0xFF + u16 as the length. The other option is to lean in to the maximal compression goal and encode all games into a single giant number to avoid the ~12 bits of overhead per game (~6MB for Caissabase). That's a lot of mult/div on the multihundred megabyte number that would be Caissabase, although they're all divisions by small numbers so maybe it's viable.
To be uniquely decodable indexed representations need to be altered to encode the end of the game as an option within the index itself. This should be more efficient than more overhead in the form of a turn count per game in the header. 
Yes, forget that above we need to store also the number of bits of x. Here 15 or 16 bits should be enough maybe for all games.

Here's another test with 200,000 random games. ~23.2 million moves (200k of those "end of game") with ~115 moves per game. The random game generator works (slightly) more realistically, king vs king waltz's are now not the norm. The General Number System representation is implemented per game as GNA. ISA is a refined ILA, a Cstring representation of a list of indexes with \0 denoting end of game instead of a turn count. MLA is gone (as a format, it's still heavily used in the engine), replaced with what I'm confusingly now calling ILA. ILA is now a 2 byte per move index + possiblemovecount format that allows for quick conversion to the indexed formats, aka everything except PGN and ANA.
[code] Form Size %ofANA bitspermove  XZSize %ofANA.xz ANA 127480406 100.0% 44.20  41050624 100.0% ISA 23274407 18.3% 8.07  15659384 38.1% IPA 15748818 12.4% 5.46  15543648 37.9% GNA 13948569 10.9% 4.84  13949324 34.0% [/code]GNA is incompressible which is a good sign. For now the overhead is implemented as varint and the numbers byte aligned because it was easy for ~12 bits of overhead per game. As order isn't used a better solution would be to order the games by GNS bit count, put all of the counts together to be easily compressed by diff+RLE (or external compressor) followed by the numbers unaligned. As bit count is stored we could omit the leading set bit per game, 1 bit per game doesn't sound like much but it's 25KB in this test and ~0.5MB for Caissabase. End of game has been encoded but not what the result was, that can be rolled into the GNS for ease at the cost of two bits although in the header it could potentially be huffman encoded to squeeze out a little more juice. The maximum length of x assuming the 50 move rule is adhered to can probably fit in 16 bits (~6000 moves from both players at over 5 bits per move, which is a very generous bpm). Probably ~99.998% of Caissabase has a length of x that can fit within 11 bits, the ~70 remaining games might need 12 bits but I haven't tested it yet. 
Sort version 0 aka no sort, just the natural order from the valid move generator which iterates over the players pieces. Not random but not good.
[code] Form Size %ofANA XZSize %ofANA.xz ANA 1868403142 100.0% 442580176 100.0% ISA 342966732 18.4% 188721016 42.6% IPA 242932132 13.0% 199165684 45.0% GNA 214021318 11.5% 211957748 47.9% [/code]Sort Version 1, which sorts in this order: Kingside castle, Queenside castle, value of a move considering the opponents immediate response. The value is calculated by adding the value of a taken piece, and subtracting the value of the players piece if it can potentially be taken. Not particularly smart but not expensive to calculate, shifts obvious good and bad material moves up and down respectively. [code] Form Size %ofANA XZSize %ofANA.xz ANA 1868403142 100.0% 442580176 100.0% ISA 342966732 18.4% 174789504 39.5% IPA 242932132 13.0% 189283260 42.8% GNA 214005548 11.5% 211950820 47.9% [/code][LIST][*]GNA is slightly compressible relative to the random game test thanks I think to some of the games in Caissabase being repeated[*]GNA with sort v1 is slightly smaller than v0 thanks to generally having smaller indexes leading to sometimes using 1 less bit per game[*]When given a large amount of real games IPA beats GNA as IPA is surprisingly compressible. Each game is byte aligned which may explain some of it as opening redundancy is available[*]Sort v1 is better than anticipated, even the IPA.xz version saved ~10MB over v0 but the ISA.xz was the real winner at ~14MB saved[/LIST] Potential refinement for sort v2 or beyond: [LIST][*]Checkmate seems obvious but expensive to calculate as the opponents next moves need to be generated to test for it. Also by definition not the most lucrative thing to check for[*]Checks on uncovered squares, particularly when the opponent can't respond by blocking with an immediate threat[*]Detecting forks from squares, particularly when the worst material trade is favourable or if a check is involved[*]Even trades when ahead on material is a common tactic, conversely a player is likely to be tradeaverse when down on material[*]Instead of just considering the immediate response, evaluate a trade chain to its logical conclusion. This can be done with individual piece cover masks to count how covered a square is instead of just using a single mask showing what the opponent has covered[/LIST] Hard to evaluate how some of these techniques stack up against each other or how they can sanely be mixed into one algorithm. 
Move sorting early in a game may be improved by pairing the compressor with an opening book.

All times are UTC. The time now is 18:14. 
Powered by vBulletin® Version 3.8.11
Copyright ©2000  2021, Jelsoft Enterprises Ltd.