Stanford CS149 I Parallel Computing I 2023 I Lecture 1 - Why Parallelism? Why Efficiency?

11 Sep 2024 (1 month ago)
Stanford CS149 I Parallel Computing I 2023 I Lecture 1 - Why Parallelism? Why Efficiency?

Course Information

  • Kavon and Kumay are the instructors for the class, CS149 Parallel Computing. (21s)
  • The class has a diverse group of students, including hardware architects, software programmers, machine learning specialists, and graphics enthusiasts. (1m48s)
  • The class will cover the fundamentals of parallel computing, focusing on both parallelism and efficiency, with an emphasis on understanding how to achieve optimal performance in various applications. (4m56s)
  • Students are required to make approximately two comments per lecture throughout the quarter. (36m34s)
  • 8 late days are provided to students and can be used for programming or written assignments. (37m55s)
  • The final exam for the course will be held in person. (40m2s)
  • The course will cover programming mechanisms for parallel processing, hardware fundamentals, and will include a stronger hardware design component, potentially involving an extra credit assignment using FPGAs. (30m49s)

Parallelism and Efficiency

  • Efficiency is a crucial aspect of parallel programming, sometimes even more important than achieving high levels of parallelism. (31m19s)
  • While parallel processing can significantly enhance performance, there are instances where simpler, more efficient solutions might be preferable. (31m8s)
  • Hardware designers prioritize efficiency to minimize costs, aiming to achieve desired performance levels with the least amount of hardware. (33m20s)

Parallel Processing Examples

  • Tina was able to add 16 numbers in approximately 40 seconds. (7m49s)
  • A second attempt at adding 16 numbers was made with two participants, and . (8m52s)
  • and were able to add their respective sets of numbers in approximately 20 seconds, but the total time to complete the task was approximately 40 seconds due to the time it took to communicate the results. (12m12s)
  • With four volunteers, the first attempt at parallelizing the task of adding numbers failed because the workload was not evenly distributed. One person had significantly more work, causing others to wait idly. (15m37s)
  • In the second attempt, the four volunteers decided to pool all the numbers together and each person would take the next available number to add, ensuring a more balanced workload distribution. This strategy resulted in a completion time of 19 seconds. (17m40s)
  • Although the second attempt was faster, it could have been further optimized by having pairs of volunteers add their sums together before the final addition, potentially saving a few seconds from the total time. (18m41s)
  • There were multiple strategies discussed for efficiently counting the number of people in a classroom. One strategy involved dividing the class into groups and assigning individuals to sum the counts from smaller groups. (19m31s)
  • Another strategy involved dividing the classroom into sections (front, middle, back) and having each section count themselves, with the counts propagating towards the middle. (22m51s)
  • A final strategy involved having everyone in the class get into groups of 10 and then counting the number of groups. (24m37s)
  • A class exercise was conducted where students were asked to find the total number of students in attendance as quickly as possible. (25m0s)
  • The exercise highlighted the importance of communication and data movement in parallel processing, as the time taken to complete the task was significantly longer than a simple sequential count due to these factors. (28m50s)

Processor Architecture and Parallelism

  • Transistor sizes getting smaller allowed for higher clock frequency and more transistors on a chip. (42m33s)
  • A computer program is a list of instructions that a computer executes. (44m38s)
  • A processor executes instructions, which perform operations like arithmetic or math, resulting in changes to values in memory or processor registers. (47m1s)
  • A simplified model of a processor consists of three main components: control (orange), arithmetic (yellow), and execution context (blue). (48m1s)
  • A computer program is a sequence of instructions that the processor executes step-by-step, fetching values from registers, performing calculations, and updating the program's state. (49m52s)
  • Parallelism can be achieved by identifying instructions or groups of instructions that are independent of each other and can be executed concurrently without affecting the final result. (52m37s)
  • Processor architects were designing hardware that could find parallelism in programs and reorder them to be parallel without the programmer's knowledge. (54m19s)
  • Studies found that there is a limit to the amount of parallelism that can be automatically extracted from programs, with a speed up plateauing after about three or four operations executed concurrently. (56m59s)
  • The increase in clock frequency of processors stopped about 15 years ago due to power constraints, as higher clock speeds lead to increased energy consumption and heat generation. (58m37s)
  • Power consumption increases with the square of clock speed, making frequency increases an inefficient way to improve speed. (59m46s)
  • Modern smartphones utilize multicore processing with specialized cores for tasks like camera processing, neural networks, and sensing. (1h2m59s)

Memory and Caches

  • Memory, abstractly, is an array of values that can be accessed by their addresses, regardless of the underlying implementation (e.g., RAM, DDR4). (1h5m3s)
  • A processor can store values in its registers and in memory. Moving a value from memory to a register involves specifying the memory address and the target register. (1h5m47s)
  • Memory can have high latency, especially DRAM, which can take hundreds of processor cycles to access data. Caches, smaller and faster storage mechanisms, mitigate this latency by storing frequently accessed data. (1h6m43s)
  • Caches leverage spatial locality, assuming that accessing data at a particular address suggests future accesses to nearby addresses. They store data in cache lines, transferring multiple bytes from memory at once. (1h10m19s)
  • If data is farther in the cache, it takes a slightly longer time to access. (1h11m56s)
  • A load instruction takes a relatively long time to work if the required data is not in the cache. (1h12m3s)
  • Ron will provide further details and implications of this concept in the next session. (1h12m13s)

Overwhelmed by Endless Content?