# Stanford EE274: Data Compression I 2023 I Lecture 8 - Beyond IID distributions: Conditional entropy

18 Apr 2024 (5 months ago)

## Entropy Coding

• Huffman coding is fast and optimal for single symbols or small blocks but has exponential complexity for larger blocks.
• Arithmetic coding achieves entropy efficiently in linear time but involves expensive division operations.
• ANS (Asymmetric Numeral Systems) variants, RS and TS, are faster than arithmetic coding and better than Huffman coding.
• RS has fast decoding compared to arithmetic coding, making it suitable for applications where decoding time is crucial.
• When the application requires the best compression ratio and is willing to sacrifice encoding or decoding speeds, arithmetic coding is the best choice.
• If extremely fast encoding and decoding are required, Huffman coding is the best option.
• For adaptive decoding, arithmetic coding is the best choice.
• To achieve close to optimal compression with good speed, arithmetic coding or range encoding can be used.
• Range encoding exploits parallel processing on modern processors for faster encoding and decoding.

## Non-IID Data

• Entropy coding assumes IID (independent and identically distributed) data, but real-life data is often non-IID.
• Non-IID data has dependencies between symbols, such as the probability of a symbol being influenced by previous symbols.
• Concatenating files with different distributions results in a mixed entropy that is not optimal for either file.
• Markov chains are stochastic processes where the probability of the next symbol depends only on the previous symbol.
• A stationary stochastic process is a process where the probabilities do not change over time.
• IID (independent and identically distributed) processes are a special case of stationary processes where all the symbols are independent and have the same distribution.
• A Markov process is a stochastic process where the probability of the next symbol depends only on the previous symbol.
• Marginalization is the process of summing over all the values of a random variable to get the probability distribution of another random variable.
• A Markov chain is a stochastic process where the probability of the next state depends only on the current state, not on any previous states.
• A stationary Markov chain is one where the probability distribution of the states does not change over time.
• A first-order Markov chain is a Markov chain where the probability of the next state depends only on the current state, not on any previous states.
• A K-order Markov chain is a Markov chain where the probability of the next state depends only on the previous K states.
• A stationary process is one where the mean does not change over time.
• A non-stationary process is one where the mean changes over time.
• To convert a non-stationary process into a stationary process, one can take the difference between consecutive values.
• Conditional entropy of U given V is defined as the average uncertainty in U given that V is known.
• Knowing more about V can decrease or increase the entropy of U, but on average, more knowledge reduces uncertainty.
• The joint entropy of U and V is equal to the entropy of U plus the entropy of V given U: H(U,V) = H(U) + H(V|U).
• Conditioning reduces entropy, meaning that H(U|V) ≤ H(U), with equality if and only if U and V are independent.
• The chain rule of entropies states that H(U,V) = H(U) + H(V|U).
• In terms of compression, the best way to jointly compress U and V is to first compress U on its own and then compress V given the knowledge of U.
• Joint entropy of random variables U and V is less than or equal to the sum of their individual entropies.

## Entropy Rate

• Entropy rate of a stationary process is defined as the limit of the average number of bits per symbol required to compress the process.
• For IID sequences, the entropy rate is equal to the entropy of each symbol.
• For K-order Markov processes, the entropy rate is equal to the entropy of the (K+1)th symbol given the first K symbols.
• English text has an entropy rate of 4.76 bits per letter when modeled as an IID process, 4.03 bits per letter when modeled as a first-order Markov process, 2.8 bits per letter when modeled as a fourth-order Markov process, and 1.3 bits per letter when modeled using a human predictor.
• The best compression for a first-order Markov source is H(U|U_1).
• For a general stationary source, the best compression is the entropy rate.
• The Shannon-McMillan theorem states that the probabilities of blocks are roughly equal to 2 to the power of minus the entropy rate.