Stanford EE274: Data Compression I 2023 I Lecture 8 - Beyond IID distributions: Conditional entropy

18 Apr 2024 (over 1 year ago)

Entropy Coding

Huffman coding is fast and optimal for single symbols or small blocks but has exponential complexity for larger blocks.
Arithmetic coding achieves entropy efficiently in linear time but involves expensive division operations.
ANS (Asymmetric Numeral Systems) variants, RS and TS, are faster than arithmetic coding and better than Huffman coding.
RS has fast decoding compared to arithmetic coding, making it suitable for applications where decoding time is crucial.
When the application requires the best compression ratio and is willing to sacrifice encoding or decoding speeds, arithmetic coding is the best choice.
If extremely fast encoding and decoding are required, Huffman coding is the best option.
For adaptive decoding, arithmetic coding is the best choice.
To achieve close to optimal compression with good speed, arithmetic coding or range encoding can be used.
Range encoding exploits parallel processing on modern processors for faster encoding and decoding.

Non-IID Data

Entropy coding assumes IID (independent and identically distributed) data, but real-life data is often non-IID.
Non-IID data has dependencies between symbols, such as the probability of a symbol being influenced by previous symbols.
Concatenating files with different distributions results in a mixed entropy that is not optimal for either file.
Markov chains are stochastic processes where the probability of the next symbol depends only on the previous symbol.
A stationary stochastic process is a process where the probabilities do not change over time.
IID (independent and identically distributed) processes are a special case of stationary processes where all the symbols are independent and have the same distribution.
A Markov process is a stochastic process where the probability of the next symbol depends only on the previous symbol.
Marginalization is the process of summing over all the values of a random variable to get the probability distribution of another random variable.
A Markov chain is a stochastic process where the probability of the next state depends only on the current state, not on any previous states.
A stationary Markov chain is one where the probability distribution of the states does not change over time.
A first-order Markov chain is a Markov chain where the probability of the next state depends only on the current state, not on any previous states.
A K-order Markov chain is a Markov chain where the probability of the next state depends only on the previous K states.
A stationary process is one where the mean does not change over time.
A non-stationary process is one where the mean changes over time.
To convert a non-stationary process into a stationary process, one can take the difference between consecutive values.
Conditional entropy of U given V is defined as the average uncertainty in U given that V is known.
Knowing more about V can decrease or increase the entropy of U, but on average, more knowledge reduces uncertainty.
The joint entropy of U and V is equal to the entropy of U plus the entropy of V given U: H(U,V) = H(U) + H(V|U).
Conditioning reduces entropy, meaning that H(U|V) ≤ H(U), with equality if and only if U and V are independent.
The chain rule of entropies states that H(U,V) = H(U) + H(V|U).
In terms of compression, the best way to jointly compress U and V is to first compress U on its own and then compress V given the knowledge of U.
Joint entropy of random variables U and V is less than or equal to the sum of their individual entropies.

Entropy Rate

Entropy rate of a stationary process is defined as the limit of the average number of bits per symbol required to compress the process.
For IID sequences, the entropy rate is equal to the entropy of each symbol.
For K-order Markov processes, the entropy rate is equal to the entropy of the (K+1)th symbol given the first K symbols.
English text has an entropy rate of 4.76 bits per letter when modeled as an IID process, 4.03 bits per letter when modeled as a first-order Markov process, 2.8 bits per letter when modeled as a fourth-order Markov process, and 1.3 bits per letter when modeled using a human predictor.
The best compression for a first-order Markov source is H(U|U_1).
For a general stationary source, the best compression is the entropy rate.
The Shannon-McMillan theorem states that the probabilities of blocks are roughly equal to 2 to the power of minus the entropy rate.