# Stanford EE274: Data Compression I 2023 I Lecture 8 - Beyond IID distributions: Conditional entropy

## Entropy Coding

- Huffman coding is fast and optimal for single symbols or small blocks but has exponential complexity for larger blocks.
- Arithmetic coding achieves entropy efficiently in linear time but involves expensive division operations.
- ANS (Asymmetric Numeral Systems) variants, RS and TS, are faster than arithmetic coding and better than Huffman coding.
- RS has fast decoding compared to arithmetic coding, making it suitable for applications where decoding time is crucial.
- When the application requires the best compression ratio and is willing to sacrifice encoding or decoding speeds, arithmetic coding is the best choice.
- If extremely fast encoding and decoding are required, Huffman coding is the best option.
- For adaptive decoding, arithmetic coding is the best choice.
- To achieve close to optimal compression with good speed, arithmetic coding or range encoding can be used.
- Range encoding exploits parallel processing on modern processors for faster encoding and decoding.

## Non-IID Data

- Entropy coding assumes IID (independent and identically distributed) data, but real-life data is often non-IID.
- Non-IID data has dependencies between symbols, such as the probability of a symbol being influenced by previous symbols.
- Concatenating files with different distributions results in a mixed entropy that is not optimal for either file.
- Markov chains are stochastic processes where the probability of the next symbol depends only on the previous symbol.
- A stationary stochastic process is a process where the probabilities do not change over time.
- IID (independent and identically distributed) processes are a special case of stationary processes where all the symbols are independent and have the same distribution.
- A Markov process is a stochastic process where the probability of the next symbol depends only on the previous symbol.
- Marginalization is the process of summing over all the values of a random variable to get the probability distribution of another random variable.
- A Markov chain is a stochastic process where the probability of the next state depends only on the current state, not on any previous states.
- A stationary Markov chain is one where the probability distribution of the states does not change over time.
- A first-order Markov chain is a Markov chain where the probability of the next state depends only on the current state, not on any previous states.
- A K-order Markov chain is a Markov chain where the probability of the next state depends only on the previous K states.
- A stationary process is one where the mean does not change over time.
- A non-stationary process is one where the mean changes over time.
- To convert a non-stationary process into a stationary process, one can take the difference between consecutive values.
- Conditional entropy of U given V is defined as the average uncertainty in U given that V is known.
- Knowing more about V can decrease or increase the entropy of U, but on average, more knowledge reduces uncertainty.
- The joint entropy of U and V is equal to the entropy of U plus the entropy of V given U: H(U,V) = H(U) + H(V|U).
- Conditioning reduces entropy, meaning that H(U|V) ≤ H(U), with equality if and only if U and V are independent.
- The chain rule of entropies states that H(U,V) = H(U) + H(V|U).
- In terms of compression, the best way to jointly compress U and V is to first compress U on its own and then compress V given the knowledge of U.
- Joint entropy of random variables U and V is less than or equal to the sum of their individual entropies.

## Entropy Rate

- Entropy rate of a stationary process is defined as the limit of the average number of bits per symbol required to compress the process.
- For IID sequences, the entropy rate is equal to the entropy of each symbol.
- For K-order Markov processes, the entropy rate is equal to the entropy of the (K+1)th symbol given the first K symbols.
- English text has an entropy rate of 4.76 bits per letter when modeled as an IID process, 4.03 bits per letter when modeled as a first-order Markov process, 2.8 bits per letter when modeled as a fourth-order Markov process, and 1.3 bits per letter when modeled using a human predictor.
- The best compression for a first-order Markov source is H(U|U_1).
- For a general stationary source, the best compression is the entropy rate.
- The Shannon-McMillan theorem states that the probabilities of blocks are roughly equal to 2 to the power of minus the entropy rate.