Stanford EE274: Data Compression I 2023 I Lecture 1 - Course Intro, Lossless Data Compression Basics

18 Apr 2024 (1 year ago)

Data Volume and Management

Data volume has grown exponentially, with zettabytes (ZB) as the current unit of measurement.
Managing this massive amount of data poses challenges in storage, data transfer, and various applications.

Videos contribute to 50% of internet data traffic, leading to bandwidth issues and requests to reduce streaming quality.
Data compression is valuable across various data types, including text, code, genomic data, emails, tweets, images, videos, and sensor data.
Compression plays a crucial role in machine learning models, enabling efficient execution on devices.

Compression involves trade-offs and design considerations to balance efficiency and data quality.
Data compression involves making design decisions to balance file size and error.
Audio and image examples illustrate the trade-offs in compression, with varying opinions on quality.
Compression aims for the succinct representation of information without losing its message.

Compression is essential due to storage costs, exponential data growth, signal denoising, communication efficiency, simplified implementations, and its relation to data modeling, prediction, and machine learning.
Compression is a critical building block for scalable and efficient systems.
Compression has applications beyond traditional usage, such as in brain-machine interfaces and eye prosthetics.

Compression connects to information theory, with Claude Shannon as the father of the field.
Bits and bytes are fundamentally connected to information.
There is a limit to how much data can be compressed, which is mathematically understood through the concept of entropy.

Lossless compression algorithms like LZ-based schemes and transforms in multimedia compression are widely used.
Numeral systems like binary are related to compression.

Clever implementations and hardware optimizations are crucial for efficient compression, such as caching computations and specific hardware for video decoding.
Data structures like suffix trees and BWT transforms were designed for compression and have broader applications, such as in genomics.

The course will cover both lossless and lossy compression, with a focus on entropy, various lossless compressors, rate distortion theory, quantization, transform coding, image compression (including JPEG, BPG, and learned compression), and video compression considering human perception.
A tool will be introduced to analyze properties of compressed files, such as JPEG images.
The course will cover lossless compression, information theory concepts, and algorithms.

Basic probability knowledge is required, including random variables, expectation, and conditional probability.

A simple example of text file compression is given, using a uniform distribution of four symbols (a, b, c, d).
The file size depends on the number of bits used per symbol, with 8 bits per symbol (1 byte) commonly used.
The concept of bits and bytes is explained, with 1 byte = 8 bits, 1 kilobyte = 1,000 bytes, and so on.
A warning is given about the potential confusion between powers of 2 and powers of 10 when dealing with large numbers.
The example file created using 1 byte per symbol is 1 megabyte in size.
ASCII encoding is introduced as a standard way to represent characters as numbers and hence as bits.
An example of using 2 bits per symbol instead of 8 bits is given, reducing the file size from 1 megabyte to 250 kilobytes.
Considerations for sharing a file compressed using a custom encoding scheme are mentioned, such as the need for the recipient to know the symbol representation.

The fixed bitwidth code uses the same number of bits for each symbol, which is optimal for a uniform distribution.
For a non-uniform distribution, the fixed bitwidth code is inefficient as it uses the same number of bits for less probable symbols.
Variable length codes assign fewer bits to more probable symbols and more bits to less probable symbols, resulting in a more efficient compression.
The decoding of variable length codes will be discussed later in the lecture.
The key takeaway is to use fewer bits for more probable symbols and more bits for less probable symbols to achieve efficient compression.

Expected code length is a measure used to evaluate codes by dividing the compressed size by the uncompressed size, providing an average number of bits used per input symbol.
The expected code length for a given code can be calculated by summing the product of each symbol's probability and its code length.
The code discussed in the video achieves a compression rate of 1.53 bits per symbol, which is an improvement over the original 2 bits per symbol.

The non-uniform distribution mentioned (49, 49, 0.1, 0.1) is similar to the previous distribution (a, b) and suggests that the optimal code for both should be similar.
The optimal compression rate for the given distribution is approximately 1.14 bits per base, which is not quite one bit per symbol but still an improvement.
The speaker emphasizes that this optimal rate is the best achievable with any scheme, even with infinite compute power or the involvement of highly intelligent individuals like Albert Einstein.

The homework assignment for the next lecture involves decoding a provided code, and the following lecture will discuss the decoding process and whether everyone arrives at the same decoded message.