Computing entropy for data compression
-
29-10-2019 - |
题
I am a little confused about how they calculate "average number of bits per symbol". Is this calculated by taking the probability of each character and multiplying it by the lg(1/probability) like regular entropy, or some other way?
Also, if this is true, how do they know for sure what the average occurrence of a letter is?
解决方案
I really shouldn't answer this because I don't know much about compression, but I can say:
- How is "bits per symbol" defined?
You are correct; it's regular entropy defined as -Σp·log(p)
. Note that this isn't actually frequency of character but frequency of message. ie, the following set of messages
{ abcdefghijklmnopqrstuvwxyz }
Looks great analyzed letter-by-letter, but has an entropy of 0.
- How can you know what the average occurrence of a letter is?
It is theoretically impossible to know for sure, unless you know the exact process by which the message is generated. You have to use some heuristic. Like taking a large sample and counting, or looking for patterns that you know are signs of redundancy. Such as english text, etc.