Stanford EE274: Data Compression I 2023 I Lecture 5 - Asymptotic Equipartition Property

19 Apr 2024 (1 year ago)

Asymptotic Equipartition Property (AEP) and Compression

The speaker introduces the concept of the asymptotic equipartition property (AEP) and its relation to compression loss and fundamental limits on compression.
The AEP states that for large n, the typical sequences have a uniform distribution with probability approximately 2^(-nH), where H is the entropy of the source.
It is impossible to achieve lossless compression with fewer than H bits per source symbol.
A compressor cannot achieve lossless compression if the rate (R) is less than the entropy (H) of the source.
The probability of a source sequence falling within the set of sequences that a compressor can reconstruct is negligible if R is less than H.

The speaker provides an example of an IID sequence of binary data generated according to a Bernoulli parameter P.
The speaker explains that with high probability, the sequence will have approximately nP ones and n(1-P) zeros.
The speaker defines typical sequences as those sequences whose probability is approximately 2^n times the entropy of the source.
The typical sequences are a small subset of all possible sequences, with a size of approximately 2^nH, where n is the length of the sequence and H is the entropy of the source.
The entropy of a biased source is less than one, and the fraction of typical sequences that can be seen with non-negligible probability is exponentially small for large n.

Huffman codes are used in various common algorithms, including JPEG, for entropy coding.
Fixed Huffman codes are predefined and embedded in formats like gzip, avoiding the need to transmit probability distributions.
Canonical Huffman codes are sorted by codeword length and lexicographically for the same length.
Canonical Huffman codes only require the lengths of the codewords to reconstruct the codebook, eliminating the need to transmit probabilities or the tree structure.
Huffman codes are not necessarily optimal for every sequence realization, but they provide the minimum expected length for a given source.
The optimality of Huffman codes is defined in terms of expected length, not for every particular sequence.

Block codes can achieve better compression than Huffman codes for specific sequences, but Huffman codes are optimal in expectation.