Stanford EE274: Data Compression I 2023 I Lecture 17 - Humans and Compression

18 Apr 2024 (1 year ago)

Human Perception in Lossy Compression

Practical lossy compressors consider human sensory perception in their design.
MSE is not always the best distortion metric, and other perceptual metrics may be more appropriate.
The human eye acts as a low-pass filter, smoothing out high-frequency details.
The retina contains different types of retinal ganglion cells that independently tile the visual scene and send signals to the visual cortex.
Rod cells are responsible for encoding intensity and can adapt to a wide dynamic range of light levels.
Cone cells are responsible for color and details and are densely packed in a small central region of the retina called the fovea.
The eye is constantly moving in rapid saccadic movements to keep changing the point of the physical scene that is projected onto the fovea.
The brain interpolates the signals from the moving eye to form a continuous and smooth-looking picture.
Human eyes have three types of cones that respond to different wavelengths of light, which are roughly red, green, and blue.
The RGB color model is based on the response of these cones and is widely used in image and video processing.
The human visual system is more sensitive to luminance than to color, so color components can be downsampled more than luminance components in image compression without significantly affecting perceived image quality.

There are other color spaces besides RGB, such as CMYK, which is used in printing.
Different color spaces can produce different reconstructions of the same image, which is important to consider in image and video compression.
The opponent process theory of color vision suggests that there are pairs of colors that are negatives of each other.
The YCbCr color space is based on the opponent process theory and models human vision by separating the image into three independent channels: luminance (Y), blue-yellow (Cb), and red-green (Cr).

Chroma down sampling reduces the file size of an image without significantly affecting its visual quality.
The 420 chroma subsampling method is commonly used in video and image compression.
Artifacts can occur when down sampling high-frequency images, such as terminal screen grabs.

Perceptual distortion metrics, such as SSIM, MS-SSIM, and VIF, are more accurate in modeling human visual perception compared to MSE.
Learned perceptual similarity (LPIPS) is a popular method for using machine learning models to measure distortion.
A hybrid approach involves combining hand-designed features with machine learning to create distortion metrics.

Traditional compressors and MSE as a distortion measure are not always the best for human perception.
RDP takes into account the statistical representation and probability distributions of the original and reconstructed images.
Unlike traditional rate-distortion optimization, RDP aims to find the best tradeoff between rate, distortion, and perception.
Practical applications of RDP include learned image and video compressors that combine distortion terms with perceptual metrics like MAE and LPIPS.

Generative modeling has advanced significantly, enabling the generation of fake symbols from various distributions.
The concept of perception (PX term) has become central to compression, with deep neural networks like Stable Diffusion and LLMs being explored for lossless and lossy compression.
Recent advancements in generative modeling can be leveraged for compression, with LLMs potentially providing knowledge of the world that can be utilized for efficient compression.

Compression is not limited to traditional modalities like images, videos, and audio, but also extends to emerging areas such as VR, genomics, and other sensor data.