Stanford EE274: Data Compression I 2023 I Lecture 4 - Huffman Codes

()

Huffman Coding

• Huffman codes are prefix codes that will be discussed in the lecture.
• Huffman coding is a greedy algorithm for constructing a prefix code for a given set of probabilities.
• The algorithm works by repeatedly merging the two nodes with the smallest probabilities until only one node is left.
• The expected code length of a Huffman code is between the entropy of the source and the entropy plus 1.
• Huffman coding can be implemented efficiently using a heap or priority queue.
• Table-based decoding is a low-level technique used in compression algorithms, especially for large volumes of data such as videos and logs.

Entropy and Related Concepts

• Recap of previous lecture:
• Kraft's inequality: A necessary and sufficient condition for a prefix code.
• Entropy: A fundamental quantity for lossless compression, defined as the expected number of bits per symbol.
• Joint entropy for IID variables: Can be calculated in two steps using independence and identical distribution.
• K Divergence: A distance measure between two probability distributions, greater than or equal to zero with equality if and only if the distributions are equal.
• Main theorem of compression: For any prefix code, the expected code length is lower bounded by the entropy.
• Shannon codes: Not optimal, but can achieve entropy or get arbitrarily close to it when working with blocks.
• Quiz review:
• Entropy of a Bernoulli random variable:
• H(X) = -p log p - (1-p) log (1-p)
• H(0) = H(1) = 0
• H(1/2) = 1
• K Divergence between two Bernoulli random variables:
• D(p || q) = p log (p/q) + (1-p) log ((1-p)/(1-q))
• The maximum Kullback-Leibler (KL) Divergence between two probability distributions, (P) and (Q), can be infinite when (P=1) and (Q=0).
• Shannon codes are not optimal for highly skewed distributions, resulting in an expected code length that is much larger than the entropy.
• Block coding can improve the efficiency of Shannon codes by grouping symbols together and assigning shorter codewords to more probable pairs.
• The optimal prefix code for a given distribution can be found by satisfying two conditions:
• Symbols with higher probabilities should have shorter or equal length codewords compared to symbols with lower probabilities.
• The two longest codewords should have the same length.

Practical Considerations and Examples

• The Fibonacci sequence is an adverse serial case for Huffman coding, resulting in long code words and a maximum depth close to the number of symbols.
• A practical example of Huffman coding is shown using the frequency of letters on a keyboard, where commonly used letters have shorter code words.
• Huffman coding was used to compress various files, including a book, a Java library, and an encrypted file.
• The compression rate for the book was 44%, and for the Java library, it was also around 40%.
• The encrypted file, however, could not be compressed because encryption algorithms typically output uniformly distributed data, making it appear random to compressors.
• Compressing encrypted data is not effective as encryption destroys the data's structure, resulting in a compression rate of one (no compression).

Next Lecture

• The next lecture will cover theoretical concepts such as entropy and block coding, exploring their importance and interaction with probability and large laws of numbers.
• Professor Saki will lead the next lecture, followed by discussions on practical block and stream codes used in state-of-the-art compressors.