Stanford CS149 I 2023 I Lecture 13 - Fine-Grained Synchronization and Lock-Free Programming

24 Sep 2024 (13 days ago)
Stanford CS149 I 2023 I Lecture 13 - Fine-Grained Synchronization and Lock-Free Programming

Deadlock

  • (5s) The lecture emphasizes the importance of understanding synchronization issues in programming, such as deadlock, livelock, and starvation.
  • (20s) Deadlock is explained using an example of cars at an intersection where none can move because each car is waiting for the next to clear the way.
  • (1m53s) Deadlock conditions include the need for a resource that cannot be obtained, holding onto resources until all needed resources are acquired, and circular dependencies.
  • (2m38s) A real-life example of deadlock is provided with a photo of a bus and car blocking each other at an intersection.
  • (4m8s) The lecture connects these concepts to computing systems, illustrating deadlock with threads that cannot proceed because they are waiting for each other to free up resources.
  • (5m15s) The four conditions necessary for deadlock in computing are mutual exclusion, no preemption, hold and wait, and circular wait.
  • There are four conditions that can lead to deadlock: mutual exclusion, hold and wait, no preemption, and a circular set of dependencies. (5m57s)

Live Lock and Starvation

  • Live lock is a situation where operations are happening but no real progress is being made. (7m46s)
  • Starvation occurs when certain operations are completely blocked because other operations are continuously taking precedence. (8m30s)

Cache Coherency

  • (11m56s) The cache coherency diagram represents a set of rules that all caches must follow to ensure coherence.
  • (12m10s) Each cache executes its own state machine logic to maintain coherence.
  • (12m34s) A key aspect of coherence is ensuring that when one processor writes to data, no other processor has a copy of that data.
  • (12m45s) The sequence of memory operations performed by different processors or caches must be considered to understand coherence.
  • (13m10s) When a cache loads an address, it broadcasts this action to all other caches.
  • (14m3s) All caches work together, and any state change in one cache must be communicated to all other caches.
  • (14m22s) When a cache loads an address, it announces this to other caches, indicating that it will only read from it.
  • (16m9s) If a cache wants to write to an address, it must announce this, and other caches must respond accordingly.
  • (16m42s) When another processor wants to write to the same address, the current cache must flush the most recent value to memory and drop its copy.
  • (17m22s) If a cache wants to read a value again, it must announce this, and the other processor will flush the value to memory.
  • (17m32s) The other processor can keep the value in a shared state to avoid a cache miss if it wants to read from it again.

Locks and Synchronization

  • The test and set instruction reads a value from memory and, if the value is zero, sets it to one. This instruction can be used to implement locks. (18m20s)
  • A lock can be represented by a variable; if the variable is zero, the lock is free, and if the variable is one, the lock is acquired. (19m15s)
  • Even when a processor has acquired a lock, other processors may still attempt to acquire the lock. These attempts, though unsuccessful, can cause cache line invalidations and movement, leading to performance issues. (23m18s)
  • A single core holding a lock for an extended period can lead to increased bus traffic as other processors spin, waiting for the lock to be released. (24m32s)
  • The test and set instruction, used for atomic operations, might involve multiple bus transactions under the hood, but a lock prefix can ensure atomicity by preventing intervening transactions. (25m40s)
  • A simple lock implementation using test and set can have performance issues as the number of processors increases, due to contention for cache lines and the time required to obtain exclusive access. (26m50s)
  • When the lock is taken by someone else, the process spins on reading to see if it is still locked. If it becomes unlocked, the process attempts a test-and-set operation. If it fails, it returns to waiting until it is unlocked again. (30m9s)
  • This approach turns multiple writes into multiple reads, reducing global traffic during the critical section. (30m41s)
  • While one processor holds the lock, other processors spin on reading the lock's value, which remains in the shared state, causing cache hits but not global traffic. (30m51s)
  • Processor one must take a cache miss to get the line back in the exclusive state to release the lock, causing other processors to invalidate their lines and attempt to acquire the lock. (31m32s)
  • During the critical section, there are no writes, but upon lock release, there is a burst of write attempts from other processors. (32m4s)
  • This method reduces continuous traffic during the critical section, although it introduces a slight delay when the lock is released due to the need for a read and then a test-and-set operation. (32m51s)
  • An alternative design involves using two integers for the lock: one for the current serving thread and one for the next ticket. Threads take a ticket and spin while their ticket is not being served. (33m51s)
  • Releasing the lock involves incrementing the current serving integer, which requires exclusive access and a write operation. (34m45s)
  • This design reduces contention as only one thread attempts to acquire the lock at a time, making the lock release involve only one write operation. (35m2s)
  • This method is fair, as threads are served in the order they arrive, and it reduces the need for atomic operations to acquire the lock. (35m39s)

Atomic Operations

  • Atomic operations, such as atomic increment and atomic compare and swap (CAS), are provided by hardware to ensure atomicity. (36m41s)
  • CAS atomically compares the value at a memory address with a comparison value and updates it with a new value only if they match. (37m9s)
  • CAS can be used to implement more complex atomic operations, such as atomic min, by repeatedly reading the current value, computing the new value, and attempting to update the memory location using CAS until successful. (38m20s)

Load Link and Store Conditional

  • Load Link and store conditional instructions can be used to create alternative solutions for cache coherence. (42m48s)
  • Load Link and store conditional instructions are often the preferred primitives on ARM processors. (43m34s)

Atomic Cast Instructions

  • Atomic cast instructions can be used to implement many atomic operations, including lock and unlock. (47m0s)

C++ Atomic Variables

  • The C++ atomic keyword ensures that operations on a variable are atomic. Atomic operations may be implemented using hardware atomic instructions or locks. (48m4s)
  • Atomic variables in C++ can be queried to determine if they are lock-free. (49m7s)
  • Relaxed memory consistency can cause issues with synchronization primitives like locks, so memory fences are necessary to ensure correct ordering of operations. (49m24s)

Concurrent Data Structures

  • Inserting nodes concurrently can lead to losing nodes, for example, if two threads attempt to insert values 6 and 7 into the list 3, 5, 10, 11, 18, one of the insertions might be lost depending on the order of thread execution. (55m42s)
  • Concurrent insertion and deletion can lead to data inconsistency and loss of parts of the list, for instance, if one thread tries to delete the node with value 10 while another thread inserts a node with value 6 before 10, the list could be left in an inconsistent state, potentially losing access to nodes starting from 10. (57m27s)
  • Using a single global lock for both insertion and deletion operations, while ensuring correctness, serializes access to the data structure, limiting parallelism and potentially hindering performance, especially if the data structure operations are performance-critical. (59m53s)
  • It is important to maintain concurrency while ensuring correctness in data structure operations. (1h2m22s)
  • Using a lock per node, a hand-over-hand locking strategy can be implemented where a lock is acquired on the next node before releasing the lock on the previous node. (1h4m49s)
  • To safely delete a node, locks should be held on both the current and previous nodes to prevent conflicts with other threads potentially modifying the list. (1h6m14s)
  • If a thread holds the lock on node 11, it prevents other threads from deleting node 18, ensuring the integrity of the linked list during operations. (1h6m59s)
  • While traversing a linked list, threads need to acquire locks on nodes to prevent issues arising from concurrent deletions or insertions by other threads. (1h8m20s)
  • Lock-free concurrent data structures, like the compare-and-swap implementation of Atomic variables, provide thread safety without using traditional locks by allowing speculative execution and checking for conflicts at the last moment. (1h11m12s)
  • Lock-free implementations are used in Java's concurrent collections. They are particularly useful in highly concurrent environments like databases and web servers where multiple threads are used to manage network traffic and latency. (1h12m21s)
  • Lock-free data structures address the issue of a thread holding a lock and then being swapped out by the operating system. This situation can cause other threads to spin or wait indefinitely for the lock. (1h13m8s)
  • Lock-free algorithms, unlike blocking algorithms, continue to execute even if they cannot acquire a lock. They rely on mechanisms like compare and swap to detect and handle changes made by other threads. (1h13m36s)

Overwhelmed by Endless Content?