Stanford CS149 I 2023 I Lecture 3 - Multi-core Arch Part II + ISPC Programming Abstractions
14 Sep 2024 (1 month ago)
Hardware Multi-threading
- Hardware multi-threading improves processor utilization by switching to another thread when the current thread stalls, for example, while waiting for a long memory request. (3m7s)
- Hardware multi-threading allows a single core processor to maintain state for and switch between multiple instruction streams, increasing processor utilization by reducing idle time when a thread stalls. (6m41s)
- Hardware multi-threading incurs costs, including increased chip space for execution contexts and potentially slower completion times for individual threads due to resource sharing. (5m17s)
Processor Utilization and Thread Count
- Running a program with a sequence of three math operations followed by a 12-cycle memory stall results in 20% processor utilization with a single thread. (10m7s)
- Achieving 100% processor utilization for the same program requires five threads, as each thread provides three cycles of work to cover the four cycles of latency incurred during a stall. (11m57s)
- To achieve 100% utilization on a processor, the ratio of mathematical instructions to memory latency is crucial. If a program has a higher ratio of math to memory latency, fewer threads are needed to achieve full utilization. (14m20s)
Cache and Thread Count
- A large data cache can reduce the number of cache misses, effectively decreasing memory access time and reducing the need for multiple threads. Conversely, removing the data cache leads to more misses and necessitates more threads for optimal performance. (14m50s)
Parallelism and Execution Units
- A processor with 16 cores, each with four-way threading and eight-wide vector processing, might appear to have a peak throughput of 128 execution units. However, to fully hide latency and achieve maximum performance, it requires 512 independent tasks to keep all execution units occupied. (18m10s)
Myth Machine
- A Myth machine is a multi-threaded machine with two threads and can run in a super scaler fashion, meaning it can run multiple instructions per clock inside a core. (18m24s)
- A Myth machine has at least three 8-wide vector ALUs and can utilize up to three vector operations per clock from two threads. (19m19s)
- A performance benefit can be observed when running two threads on a Myth core compared to one thread because a single thread might not provide enough instruction-level parallelism to saturate all execution units. (20m50s)
Instruction Execution and Multi-threading
- An execution unit in a CPU core can only process one instruction at a time, even with techniques like simultaneous multi-threading. (24m29s)
- Intel CPUs use multiple fetch decode boxes to find independent instructions across multiple threads to maximize instruction throughput. (24m44s)
Modern CPUs and GPUs
- Modern Intel CPUs and Nvidia GPUs achieve high performance through massive parallelism, with Intel cores featuring multiple execution units and Nvidia GPUs having many cores with wide vector units and support for numerous threads. (27m7s)
Superscalar Processing
- A superscalar processor can execute a scalar and a vector instruction simultaneously if the instructions are independent. (30m29s)
Modern Intel Chip Architecture
- Modern Intel chips resemble a processor with two cores, each having four-way multithreading and the capability to execute a combination of two instructions per clock cycle. (33m21s)
GPU Vector Processing
- Nvidia and AMD GPUs don't generate vector instructions; they use scalar instructions and execute eight threads with the same program counter on a single Arithmetic Logic Unit (ALU) to achieve the same effect as vector processing. (34m10s)
Thread Scheduling
- The operating system (OS) determines which thread runs on which execution context. (35m21s)
- If two threads need to share data or a cache, it might make sense to schedule them on the same core. (36m9s)
- The operating system decides which threads to run on which execution contexts, but the chip decides which instructions to select for execution. (36m41s)
Parallelism and Vectorization
- An element-wise addition or multiplication of two large arrays is a good application to run on modern computers because it has a high degree of parallelism and is vectorizable. (39m12s)
- The Nvidia V100 GPU has 5,000 multipliers and runs at 1.6 GHz, requiring a lot of data to keep it busy. (40m21s)
Latency and Throughput
- Latency is the amount of time it takes for something to travel from a starting point to an end point. For example, if it takes 30 minutes to drive from San Francisco to Stanford, then the latency is 30 minutes. (41m47s)
- Throughput is the rate at which something is processed. For example, if one car can travel on a road to Stanford every 30 minutes, the throughput is two cars per hour. (42m25s)
- Latency and throughput can be decoupled. For example, if a system can transfer two items at a time from memory to a processor, the latency will remain the same, but the bandwidth will double. (47m0s)
Throughput and Bottlenecks
- It takes 2 hours to do one load of laundry if you factor in the time it takes to wash (45 minutes), dry (1 hour), and fold (15 minutes) the clothes. (47m57s)
- If you only have one washer and one dryer, the throughput of laundry is one load per hour, even if you start a new load in the washer as soon as it finishes the previous load. (51m23s)
- The throughput is limited by the slowest part of the process, which in this case is the dryer. (51m40s)
Memory Bandwidth Bottleneck
- A computer with a simple program that loads 64 bytes and performs additions, with one math operation per clock, is used as an example to illustrate the concept of memory bandwidth limiting performance. (53m36s)
- The computer is modeled as a two-way superscalar processor that can fetch eight bytes per clock from memory. (54m8s)
- The memory system is modeled as having a latency of eight cycles to fetch a cache line. (55m5s)
- The processor is limited to having three outstanding loads to memory at any given time. (55m50s)
- Due to the memory bandwidth bottleneck, the processor stalls frequently, waiting for data to arrive from memory. (56m49s)
- A modern Nvidia memory system with 900 gigabytes of bandwidth is used as an example to illustrate the scale of the problem. (57m15s)
- The example program would require 100 terabytes of bandwidth to keep the ALUs fed with data, highlighting the significant disparity between computation and data transfer rates. (57m23s)
- Running such a program on a modern GPU or CPU would result in approximately 1% efficiency due to memory bandwidth limitations. (58m29s)
- Even with perfect prefetching, memory bandwidth can still bottleneck a program's performance if it tries to extract data faster than the memory system can provide it. (59m39s)
- Most programs are bottlenecked by memory bandwidth, regardless of parallelism, unless the ratio of math operations to memory accesses is manipulated. (1h0m16s)
ISPC Programming Language
- ISPC is a programming language that makes parallel computing concepts clear due to its low-level nature, allowing developers to understand what each core and thread is doing at any given time. (1h4m40s)
- When a C program calls an ispc function, instead of a single thread executing the function, the ispc runtime spawns a group of "program instances". (1h7m18s)
- Each program instance has a unique
program index
and all instances share the same program count
, which represents the total number of instances. (1h7m40s)
- This programming model, known as SPMD (Single Program Multiple Data), allows each instance to execute the same code but potentially operate on different data based on its
program index
. (1h9m10s)
- There are multiple ways to divide work among program instances in ispc, including interleaving elements of an array and assigning contiguous blocks of an array to each instance. (1h11m53s)
- While the speaker provides examples of manually dividing work, they emphasize that programmers can use the "for each" construct to instruct ispc to automatically assign iterations of a loop to program instances. (1h15m4s)
- The speaker encourages viewers to consider the different ways ispc could implement the "for each" construct and to understand that multiple valid implementations exist. (1h15m46s)