Stanford CS149 I Parallel Computing I 2023 I Lecture 11 - Cache Coherence
22 Sep 2024 (15 days ago)
Spark: Distributed Computing Framework
- Spark aims to enable efficient distributed computing for applications that heavily utilize intermediate data, such as iterative algorithms and queries. (1m33s)
- Resilient Distributed Datasets (RDDs) are the key abstraction in Spark, representing read-only, ordered collections of records. (2m56s)
- Spark optimizes for locality through techniques like fusion and tiling to enhance parallel performance. (4m1s)
- Transformations can have wide dependencies that require communication between nodes in a system, for example, a group by key operation. (7m41s)
- If data is partitioned using the same hash partitioner, the runtime system can detect narrow dependencies and perform fusion, reducing communication overhead. (11m2s)
- Spark maintains fault tolerance through lineage, which is a log of transformations applied to RDDs, allowing for data recreation from persistent storage. (12m11s)
- If a node crashes during computation, the system recovers by rerunning the log of operations to recreate lost data partitions. This leverages data replication in HDFS and the coarse-grained nature of the operation log. (15m45s)
- Spark offers significant performance improvements, often orders of magnitude faster than Hadoop, due to its use of in-memory processing and reduced reliance on disk access for iterative tasks. (18m30s)
- Spark's ecosystem includes domain-specific frameworks like Spark SQL for database processing, Spark MLlib for machine learning, and Spark GraphX for graph operations, all built upon the underlying RDD abstraction. (20m30s)
- Spark can be extended beyond transformations and actions to perform tasks like graph analysis and database operations. (23m26s)
- If data fits in memory, using a distributed system like Spark is not recommended due to overheads, as a single system will be more efficient. (26m27s)
Cache Memory: Exploiting Locality
- Modern processors dedicate a significant portion of their chip area to cache (30% or more) to enhance performance by exploiting locality. (28m31s)
- Cache lines are used to exploit spatial locality in program behavior, which is the tendency to access memory locations that are close together. (29m58s)
- The Intel Skylake chip, found in the myth machines, has a 32 kilobyte L1 data cache that is 8-way set associative, meaning that any given cache line can only be stored in one of eight places in the cache. (34m59s)
- Conflict misses occur when the limited associativity of a cache prevents a cache line from being stored in a particular location, even if there are other available locations in the cache. (36m33s)
- Set-associative caches are organized into sets, which can be visualized as buckets, with each set containing a fixed number of cache lines. (37m26s)
- As cache size increases, the set associativity does not always increase proportionally. (38m40s)
- A cache line consists of data and metadata. The metadata includes a tag, which stores the memory address of the data, and a dirty bit, indicating whether the data in the cache has been modified. (39m31s)
- In a write-back cache, data is written to the cache first and only later written to the main memory. In contrast, a write-through cache writes data to both the cache and main memory simultaneously. (41m31s)
- Write-allocate, used with write-back caches, fetches the entire cache line from main memory upon a write miss and then writes the data. (43m39s)
- Right back caches write data to the cache and set a dirty bit, while right through caches write data to both the cache and memory simultaneously, eliminating the need for a dirty bit. (44m40s)
- The tag in a cache represents the address of a cache line, not just a single variable, allowing for spatial locality. (49m25s)
- Higher set associativity in a cache reduces miss rates but increases the complexity of cache lookups. (50m35s)
Cache Coherence: Maintaining Data Consistency
- When multiple processors with individual caches are connected to a main memory, a memory coherence problem can occur. (52m52s)
- If multiple processors read and write to the same memory address, inconsistencies can arise because updates to one cache might not be reflected in other caches or the main memory. (56m1s)
- Determining the "last" value written to a memory address becomes complex in a multi-processor system because simultaneous writes from different processors need a clear order to ensure data consistency. (59m0s)
- For any given memory location, access should be serialized to maintain coherence. This means that reads and writes to a specific address should happen in the order issued by the processor, ensuring subsequent reads reflect the last write to that location. (1h0m51s)
- Two key invariants are crucial for cache coherence: the single writer multiple reader invariant, which allows only one processor to modify a cache line at a time while permitting multiple readers, and the data value invariant, which ensures that all readers see the last written value during a read-only epoch. (1h2m1s)
- While software-based solutions using virtual memory pages are possible, they are slow and prone to issues like false sharing. Hardware-based solutions operating at the cache line level, such as snooping and directory-based systems, offer a more efficient approach to maintaining cache coherence. (1h6m47s)
- A single shared cache for multiple processors can lead to a performance bottleneck due to bandwidth limitations and interference between processors. (1h7m52s)
- Shared caches can experience both constructive interference, such as pre-fetching data for subsequent iterations, and destructive interference, where one processor's cache misses are caused by another processor. (1h8m46s)
Snooping Cache Coherence: Using a Bus
- Snooping cache coherence schemes use an interconnect, such as a bus, to connect processors and their caches, allowing them to communicate and maintain data consistency. (1h11m2s)
- Buses have two important properties for cache coherence: they allow only one transaction at a time, providing serialization, and they broadcast all transactions to all connected devices. (1h14m12s)
- A bus in computer architecture functions as a broadcast medium, where all processors receive communication from a single source. (1h15m0s)
- The MSI protocol, used in cache coherence, utilizes three states: Modified, Shared, and Invalid. (1h18m27s)
- The MSI protocol ensures exclusive access for write operations by using a combination of processor operations (processor read, processor write) and bus transactions (bus read, bus read exclusive, bus write back). (1h19m14s)