Stanford CS149 I Parallel Computing I 2023 I Lecture 14 - Midterm Review
25 Sep 2024 (19 days ago)
Course Themes
- (5s) The course covers broad themes, including understanding multicore architecture, which is emphasized in lectures one, two, and three.
- (27s) Another major theme is optimizing parallel programs, focusing on identifying workload imbalances and improving cache performance.
- (1m6s) GPU architecture is discussed, highlighting the similarities with multicore architecture and the basics of the CUDA programming model.
- (1m33s) The lecture on deep neural networks emphasizes the importance of cache locality and specialized hardware, which will be covered more in the final exam.
- (2m8s) Data parallel thinking is another topic, with exercises to reinforce understanding.
Synchronization and Cache Coherence
- (2m27s) Fine-grain locking, compare and swap, and their relationship to cache coherence are discussed, with practice problems provided.
- (3m17s) The MSI protocol and relaxed memory consistency are also covered, explaining how different processors might see writes and reads in different orders.
- (4m15s) The concept of compare and swap is explained as an atomic operation used for synchronization, ensuring mutual exclusion in code sections.
- (6m32s) Atomic Min operations ensure mutual exclusion, preventing other threads from accessing a variable while a Min operation is being performed on it.
- (7m48s) Atomic compare and swap operations do not ensure mutual exclusion on the entire Min operation.
- (8m48s) Atomic compare and swap operations only ensure that if the value in memory has changed since it was initially read, the Min operation will be redone.
- (12m39s) Starvation occurs when a system is making progress, but one thread never succeeds.
- (14m40s) Atomic cast implementation requires cache coherence, and even with it, it involves a bus read X (a write) which is expensive.
MapReduce
- (18m28s) MapReduce, inspired by data parallel map and reduce, was a system developed by Google for processing massive datasets, with "map" iterating over lines of files and "reduce" aggregating results.
- (18m29s) MapReduce is a programming model used to process large datasets distributed across multiple machines.
- (18m45s) In MapReduce, the map function is applied to each line of the input file, producing key-value pairs.
- (20m22s) The system then groups all values with the same key into a file, effectively shuffling and reorganizing the data.
- (21m18s) The reduce function then processes each key and its associated list of values, producing a final output.
- (25m41s) Latency can be hidden with massive parallelism, but bandwidth is a significant concern, especially when dealing with terabytes of data being read from and written to disk between computational phases.
- (27m9s) While platforms like Spark and MapReduce offer parallelism and fault tolerance, their reliance on disk I/O can lead to performance bottlenecks, making a single-core laptop with sufficient memory outperform a large cluster for certain tasks.
- (29m29s) Amazon Prime Video, initially built on a microservices architecture using Amazon Lambda for scalability, transitioned back to a smaller number of powerful servers to prioritize data locality and reduce costs without sacrificing performance.
- (29m50s) Scaling out is essential when working with large datasets that cannot be stored in memory on a single node.
- (30m0s) Systems designed for large datasets are valuable because they allow for distributed storage and processing.
MSI Cache Coherence Protocol
- (30m25s) The discussion shifts to the MSI (Modified, Shared, Invalid) cache coherence protocol.
- (30m46s) Exams are open notes, and diagrams of the MSI protocol are provided, so memorization is not necessary.
- (31m2s) Understanding state transitions in the MSI protocol is crucial for answering related questions.
- (31m20s) The default MSI protocol involves a bus read for cache misses, which is confirmed to be correct.
- (31m31s) If an address is not in the cache, it is in the invalid state.
- (31m53s) A read operation moves the address to the shared state, and a write operation can move it to the modified state if no other processor has accessed it.
- (32m14s) The process of reading and then writing is a two-stage process.
- (32m22s) Atomic operations like CAS (Compare-And-Swap) must be handled as a single write-bus transaction to maintain atomicity.
- (33m0s) Atomic operations require special logic to ensure they are executed correctly within the MSI protocol.
- (33m30s) The sequence of operations in the MSI protocol is reviewed, with a practical demonstration involving two participants acting as caches.
- (34m31s) Participants simulate cache operations, writing the state of their cache lines on the board to illustrate the protocol.
- (36m35s) Cache zero, initially in the shared state, transitions to the modified state after writing a value, guaranteeing no other cache holds the value.
- (37m24s) Cache zero updates the value again without coherence traffic, as it maintains exclusive access.
- (38m7s) Cache one, attempting to store a value, triggers a flush from cache zero, updating memory and invalidating cache zero's line.
- (40m52s) Cache zero, reading the value, transitions to the shared state, reflecting the potential for other caches to hold the value.
- (41m19s) The discussion involves data structures and cache coherence, focusing on how to determine if another cache may have certain data.
- (41m31s) Shared data structures provide no guarantees about the presence of data in other caches.
- (41m41s) Before updating a variable, exclusive access must be obtained, which involves notifying other caches to invalidate their copies.
- (42m1s) If a cache updates a value without notifying others, it may lead to outdated reads by other processors.
- (42m14s) To get exclusive access, a cache must declare its intent to write, prompting other caches to invalidate their copies.
- (42m40s) The MSI protocol is discussed, highlighting the difference between correct and efficient actions regarding memory writes.
- (42m54s) When reading in the future, the process involves invalidating outdated cache lines and triggering a flush to ensure updated values.
- (43m10s) The state of the cache line (e.g., dirty) affects whether a flush is needed.
- (43m42s) Coherence is maintained for each address separately, with state machines managing different addresses.
- (44m30s) When a processor writes to a variable, it must notify other caches, which may involve tearing down and flushing the data.
- (44m51s) Any change in a local cache's state requires notifying all other caches to maintain coherence.
- (45m21s) Notifications are typically implemented as point-to-point messages rather than broadcasts.
- (45m31s) A question is raised about why data needs to be flushed back to memory rather than being directly transferred between caches.
- (46m16s) In more complex cache coherence protocols, such as the five-state MESI protocol, a cache with an up-to-date copy can provide data directly to another cache.
- (46m41s) The discussion clarifies that cache lines contain multiple values, and writing back to memory ensures all data in the cache line is updated.
- (47m16s) The entire cache line must be sent because it is unclear which parts are dirty, requiring a more advanced cache to track dirtiness at the bit or byte level.
- (47m45s) CPUs have states that allow cache to be transferred without flushing, depending on whether the cache line is in a shared or modified state.
- (48m53s) In the MESI protocol, the exclusive state allows a processor to write to a cache line without notifying others if no other processor has the data.
- (50m53s) If a processor in the exclusive state detects a bus read, it transitions to the shared state.
- (51m22s) Elevating from the shared state to the exclusive state would require all other processors to invalidate their copies, which complicates the protocol.
- (51m57s) A more advanced cache coherence protocol could allow a processor to notify others when it goes to the invalid state, enabling elevation to the exclusive state if only one processor has the data.
- (52m41s) Different cache lines do not interact with each other in terms of coherence, but bringing a new cache line into the cache could cause the eviction of an existing one, leading to invalidation.
- (53m54s) Caches are always full, requiring a process to evict an existing value and potentially flush data before reading a new value.
Relaxed Consistency
- (58m2s) Relaxed consistency, unlike coherence (which concerns operations on the same address), deals with the order of effects from operations on different addresses.
- (58m31s) While a single thread of control maintains program order for instructions, relaxed consistency allows for variations in the observed order of updates to different addresses by other threads (e.g., T1 might see the update to Y before X, even though T0 wrote to X first).
- (59m45s) The discussion emphasizes the importance of understanding the observation of another processor's writes and maintaining consistency with C semantics.
- (1h0m10s) Instructions are reordered for various reasons, but they are always observed in program order.
- (1h0m59s) Creating questions that involve more than right-right reordering is challenging, and right-right reordering is a significant concept to understand.
Segmented Scan
- (1h1m51s) Segmented scan is defined as performing a scan on a sequence of sequences, and the algorithm provided allows for computing these scans with O(n) parallelism.
- (1h3m5s) The segmented scan algorithm is explained, and it is noted that understanding its definition is crucial for higher-level algorithm development.
CUDA and GPU Architecture
- (1h3m59s) CUDA is described as a bulk launch system similar to a tasking API, where tasks are run in thread blocks with a specified number of threads.
- (1h5m24s) The example of a CUDA program with 128 threads per block and requiring over 512 bytes of storage is given, highlighting the resource requirements for thread blocks on a GPU.
- (1h6m33s) The number of thread blocks that can run concurrently on a processor is limited by the shared memory resources. If thread blocks require less shared memory, more thread blocks can fit and run concurrently.
- (1h8m14s) The Nvidia scheduler assigns thread blocks to available cores. To maximize performance, the scheduler distributes thread blocks across different cores to avoid making the execution inherently sequential within a single core.
- (1h10m27s) A warp consists of 32 Cuda threads that are executed concurrently on a group of 32 SIMD ALUs. The scheduler selects a warp and runs one instruction for each of the 32 threads in the warp.
- (1h12m4s) To modify the gang size in ispc to 64, you would need to recompile your code. This process would result in the generation of one thread with 64 wide Vector instructions.
- (1h12m15s) A key distinction between GPU SIMD and CPU SIMD lies in the role of the compiler. In CPU SIMD, the compiler is responsible for generating instructions based on the machine architecture. Conversely, in GPU assembly, the compiler's role is less extensive, primarily generating sequential instructions.
- (1h12m38s) Nvidia determines the execution of threads within a thread block, historically dividing them into consecutive groups of 32 addresses for parallel processing.