gzip: Interpreting bits for decompression

https://stackoverflow.com/questions/13678288

04-12-2021
|

Frage

I'm trying to learn the Inflate algorithm by implementing it in Java so that I may then implement it in assembly for a CPU with a very restricted instruction set.

I'm having trouble reading in the correct code lengths from the file after reading in the number of literal, distance, and length codes. I'm following through an implementation as described here, where they provide an example .gz file gunzip.c.gz. After reading in the gzip headers, the following is the first 56 bits of the first (and if I'm reading it correctly, only) compressed data block:

10111101 00011011 11111101 01101111 11011010 11001000 11110010

I will refer to bits in a byte by the following offsets: [76543210] The first byte 10111101 contains the end of block at offset 0, offset 21 contain the two bits that determine the block type.
In this case, this is the last block, and it is a dynamic Huffman Tree.

The following bits to be interpreted are the number of literal (5 bits), distance (5 bits), and length (4 bits) codes. They are read in as follows:

10111101 -> 10111XXX -> 23 (literal)
00011011 -> XXX11011 -> 27 (distance)
00011011 11111101 -> 000XXXXX XXXXXXX1 -> 1000 -> 8 (length)

After this point, I'm having trouble. The next 36 bits should be 12 sets of 3 bits that indicate the code lengths.
From the bytes above, I see (interpreted as little-endian):

110 111 111 011 011 010 011 011 100 100 101 100
3   7   7   6   6   2   6   6   1   1   5   1

But I expect (as indicated in the link above)

101 011 011 110 110 110 110 110 110 001 001 001
5   6   6   3   3   3   3   3   3   4   4   4

I can't see any way to get these values from the file. I must be misinterpreting how the bits should be read. Given a byte with a 5 bit value followed by a 3 bit value, [CBA43210] I would read it as 01234 and ABC.

Is the article wrong in what the correct code lengths should be, or more likely, am I wrong on how they should be interpreted?

Lösung

bits themselves are read LSB to MSB as illustrated in figure 6:
<----
87654321

(from the document linked by the question) indicates that

[CBA43210] I would read it as 01234 and ABC.

(from the question itself) is correct.

Little-endian means that least significant digits (here, bits) are stored or read (here, interpreted) first.

However, 01234 and ABC themselves are in little-endian, so the byte 110 11000 would be interpreted as 3 3, not 24 6.

This means the five, five and four bits in

10111101 00011011 ...

are

xxx11101 => 11101  => 29
101xxxxx
xxxxxx11 => 11 101 => 29
xx0110xx => 0110   => 6

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow