# Stanford EE274: Data Compression I 2023 I Lecture 1 - Course Intro, Lossless Data Compression Basics

## Data Volume and Management

- Data volume has grown exponentially, with zettabytes (ZB) as the current unit of measurement.
- Managing this massive amount of data poses challenges in storage, data transfer, and various applications.

## Data Compression Applications

- Videos contribute to 50% of internet data traffic, leading to bandwidth issues and requests to reduce streaming quality.
- Data compression is valuable across various data types, including text, code, genomic data, emails, tweets, images, videos, and sensor data.
- Compression plays a crucial role in machine learning models, enabling efficient execution on devices.

## Compression Design Considerations

- Compression involves trade-offs and design considerations to balance efficiency and data quality.
- Data compression involves making design decisions to balance file size and error.
- Audio and image examples illustrate the trade-offs in compression, with varying opinions on quality.
- Compression aims for the succinct representation of information without losing its message.

## Reasons to Care about Compression

- Compression is essential due to storage costs, exponential data growth, signal denoising, communication efficiency, simplified implementations, and its relation to data modeling, prediction, and machine learning.
- Compression is a critical building block for scalable and efficient systems.
- Compression has applications beyond traditional usage, such as in brain-machine interfaces and eye prosthetics.

## Compression and Information Theory

- Compression connects to information theory, with Claude Shannon as the father of the field.
- Bits and bytes are fundamentally connected to information.
- There is a limit to how much data can be compressed, which is mathematically understood through the concept of entropy.

## Lossless Compression Algorithms

- Lossless compression algorithms like LZ-based schemes and transforms in multimedia compression are widely used.
- Numeral systems like binary are related to compression.

## Efficient Compression Techniques

- Clever implementations and hardware optimizations are crucial for efficient compression, such as caching computations and specific hardware for video decoding.
- Data structures like suffix trees and BWT transforms were designed for compression and have broader applications, such as in genomics.

## Course Coverage

- The course will cover both lossless and lossy compression, with a focus on entropy, various lossless compressors, rate distortion theory, quantization, transform coding, image compression (including JPEG, BPG, and learned compression), and video compression considering human perception.
- A tool will be introduced to analyze properties of compressed files, such as JPEG images.
- The course will cover lossless compression, information theory concepts, and algorithms.

## Basic Probability Concepts

- Basic probability knowledge is required, including random variables, expectation, and conditional probability.

## Simple Text File Compression Example

- A simple example of text file compression is given, using a uniform distribution of four symbols (a, b, c, d).
- The file size depends on the number of bits used per symbol, with 8 bits per symbol (1 byte) commonly used.
- The concept of bits and bytes is explained, with 1 byte = 8 bits, 1 kilobyte = 1,000 bytes, and so on.
- A warning is given about the potential confusion between powers of 2 and powers of 10 when dealing with large numbers.
- The example file created using 1 byte per symbol is 1 megabyte in size.
- ASCII encoding is introduced as a standard way to represent characters as numbers and hence as bits.
- An example of using 2 bits per symbol instead of 8 bits is given, reducing the file size from 1 megabyte to 250 kilobytes.
- Considerations for sharing a file compressed using a custom encoding scheme are mentioned, such as the need for the recipient to know the symbol representation.

## Fixed Bitwidth Code and Variable Length Codes

- The fixed bitwidth code uses the same number of bits for each symbol, which is optimal for a uniform distribution.
- For a non-uniform distribution, the fixed bitwidth code is inefficient as it uses the same number of bits for less probable symbols.
- Variable length codes assign fewer bits to more probable symbols and more bits to less probable symbols, resulting in a more efficient compression.
- The decoding of variable length codes will be discussed later in the lecture.
- The key takeaway is to use fewer bits for more probable symbols and more bits for less probable symbols to achieve efficient compression.

## Expected Code Length

- Expected code length is a measure used to evaluate codes by dividing the compressed size by the uncompressed size, providing an average number of bits used per input symbol.
- The expected code length for a given code can be calculated by summing the product of each symbol's probability and its code length.
- The code discussed in the video achieves a compression rate of 1.53 bits per symbol, which is an improvement over the original 2 bits per symbol.

## Optimal Compression Rate

- The non-uniform distribution mentioned (49, 49, 0.1, 0.1) is similar to the previous distribution (a, b) and suggests that the optimal code for both should be similar.
- The optimal compression rate for the given distribution is approximately 1.14 bits per base, which is not quite one bit per symbol but still an improvement.
- The speaker emphasizes that this optimal rate is the best achievable with any scheme, even with infinite compute power or the involvement of highly intelligent individuals like Albert Einstein.

## Homework Assignment

- The homework assignment for the next lecture involves decoding a provided code, and the following lecture will discuss the decoding process and whether everyone arrives at the same decoded message.