Stanford EE274: Data Compression I 2023 I Lecture 10 - LZ and Universal Compression
18 Apr 2024 (8 months ago)
LZ77 Compression Algorithm
- LZ77 is a lossless data compression algorithm that achieves optimal results for any stationary source without prior knowledge of the source distribution.
- LZ77 works by dividing an input sequence into three streams: unmatched literals, match length, and match offset.
- LZ77 can efficiently encode short matches that occur far apart, saving bits compared to encoding the individual characters.
- LZ77 uses a sliding window to find matches within a limited scope, rather than keeping the entire past in memory.
- Match finding is the core of an LZ77 encoder, and it involves indexing past occurrences of sequences in a data structure like a hash table.
- LZ77 parsing aims to find the best match to reduce literals and improve compression.
- LZ77 converts a complex stationary distribution into a simpler IID distribution, allowing the use of entropy coders for efficient compression.
LZ77 Compression Techniques
- The greedy approach takes the first match it finds, while the lazy approach skips short matches in the hope of finding longer ones in the future.
- Increasing the minimum match length leads to more literals and fewer matches, as short matches are disallowed.
- The lazy strategy outperforms the greedy strategy for any minimum match length.
LZ77 Compressor Comparison
- Modern LZ77 compressors like Zstandard outperform older ones like gzip due to better algorithms, memory usage, and implementation techniques.
- LZ4 is the fastest compressor for both compression and decompression, but it provides less compression compared to other compressors.
- Zstandard is a modern alternative to gzip with better compression and decompression speed, but it uses more memory.
- For long-term archival purposes, compressors like XZ and bzip2 offer better compression but are slower.
- Advanced techniques like context-based coding and dictionary compression can achieve even higher compression ratios but are more resource-intensive.
LZ77 Compression Considerations
- Before compressing data, consider if all the data is necessary and if there are compressible parts.
- Think about the data access pattern and store data accordingly, such as using cheaper cold storage for less frequently accessed data.
- Lossy compression may be necessary for practical purposes, especially for data like videos.
- Avoid implementing your own compressor from scratch or buying an expensive license without first understanding the application and requirements.
- Understand the application, speed, memory usage, and location of compression and decompression before choosing a compressor.
- Consider the command-line interface and library availability of different compressors.
- The choice of compressor depends on the homogeneity of the data and the desired compression speed and ratio.
- When using Zstandard, it's important to choose the right compression level and parameters for optimal performance.
- Reusing the Zstandard context when compressing multiple files can speed up the process.
- Dictionary compression is useful for very small files where finding good matches within the file is unlikely.
- Zstandard has better compression and speed compared to gzip due to its improved Huffman coding, advanced match-finding strategies, and efficient implementation.
- Before creating a custom compressor, consider if the compression gap is significant, if the data has specific characteristics that can be exploited, if it's a closed ecosystem, and if there's enough data or a strong need for it.
- Creating a custom compressor involves designing, implementing, testing, and maintaining the compression format, ensuring backward compatibility, and supporting various platforms and architectures.
- Domain-specific compressors can be valuable in certain scenarios, such as genomics, but they may limit interoperability.
- If considering a domain-specific compressor, it's recommended to collaborate on a project to benefit from expertise and avoid fragmentation.