I'm trying to learn the Inflate algorithm by implementing it in Java so that I may then implement it in assembly for a CPU with a very restricted instruction set.
I'm having trouble reading in the correct code lengths from the file after reading in the number of literal, distance, and length codes.
I'm following through an implementation as described here, where they provide an example .gz file gunzip.c.gz. After reading in the gzip headers, the following is the first 56 bits of the first (and if I'm reading it correctly, only) compressed data block:
10111101 00011011 11111101 01101111 11011010 11001000 11110010
I will refer to bits in a byte by the following offsets: [76543210]
The first byte 10111101
contains the end of block
at offset 0
, offset 21
contain the two bits that determine the block type.
In this case, this is the last block, and it is a dynamic Huffman Tree.
The following bits to be interpreted are the number of literal (5 bits), distance (5 bits), and length (4 bits) codes. They are read in as follows:
10111101 -> 10111XXX -> 23 (literal)
00011011 -> XXX11011 -> 27 (distance)
00011011 11111101 -> 000XXXXX XXXXXXX1 -> 1000 -> 8 (length)
After this point, I'm having trouble. The next 36 bits should be 12 sets of 3 bits that indicate the code lengths.
From the bytes above, I see (interpreted as little-endian):
110 111 111 011 011 010 011 011 100 100 101 100
3 7 7 6 6 2 6 6 1 1 5 1
But I expect (as indicated in the link above)
101 011 011 110 110 110 110 110 110 001 001 001
5 6 6 3 3 3 3 3 3 4 4 4
I can't see any way to get these values from the file. I must be misinterpreting how the bits should be read. Given a byte with a 5 bit value followed by a 3 bit value, [CBA43210]
I would read it as 01234
and ABC
.
Is the article wrong in what the correct code lengths should be, or more likely, am I wrong on how they should be interpreted?