# Stanford CS149 I Parallel Computing I 2023 I Lecture 8 - Data-Parallel Thinking

## Data Parallel Programming Concepts

- The remainder of the course will focus on writing code using operations that are known to be highly parallel, rather than focusing on low-level details like threads and dependencies. (3m30s)
- The concept of "sequences" will be used, which are ordered collections of elements that can only be accessed through specific operations, unlike arrays. (5m15s)

## Data Types for Sequences

- Different programming languages have their own data types that embody the concept of sequences, such as
`std::sequence`

in C++, lists in Scala, dataframes and NumPy arrays in Python, and tensors in PyTorch and TensorFlow. (5m23s)

## Higher-Order Functions for Parallelism

- Map is a higher-order function that takes a function
`F`

and a sequence`s`

, partitions`s`

into smaller subsequences, applies`F`

to each subsequence in parallel, and concatenates the results. (12m46s) - Fold, also known as
`foldLeft`

in Scala, is a higher-order function that takes a function`f`

(which takes two arguments of types A and B and produces a B), a starting element`B`

, and a sequence of elements of type`A`

. It iteratively applies`f`

to the elements of the sequence and the accumulator, producing a single output of type`B`

. (13m27s)

## Parallelizing Map and Fold

- The function argument of map only needs to handle a single element of the array, making it inherently parallelizable. (11m0s)
- Map can be implemented in parallel using techniques like SIMD processing or thread pools to handle arbitrarily large input sequences. (12m2s)
- Parallelizing
`fold`

requires the function`f`

to be associative. If`f`

is associative, a parallel implementation can split the input sequence, apply`fold`

to the subsequences in parallel, and then combine the results using`f`

. (17m40s)

## Data Parallel Primitives

- Map is a primitive that applies a function to a sequence of numbers. For example, multiplying each number in a sequence by 10. (19m3s)
- Fold is a primitive that takes a sequence and reduces it to a single value using a binary operator, such as summing all elements in a sequence. (19m8s)
- Scan is a primitive that produces a sequence where each element is the result of applying a binary operator (like addition) to all elements up to and including the current element in the input sequence. (20m20s)

## Parallel Summation on GPUs

- Parallel Summation on GPUs: The speaker discusses implementing a parallel sum operation on a GPU, aiming for a highly parallel solution with the same number of threads as elements. (25m48s)

## Analyzing Algorithm Complexity

- Analyzing Algorithm Complexity: A sequential scan of an array of size 'n' has a cost of O(n), while a parallel scan using a combining tree approach initially results in O(n log n) work. (27m15s)

## Optimizing Parallel Scan

- Optimizing Parallel Scan: An alternative algorithm, attributed to Guy Blelloch, maintains a logarithmic span while reducing the work complexity to O(n) by employing a two-phase approach with an "upsweep" and "downsweep" (combining and splatting tree) to efficiently compute partial sums. (28m38s)

## Practical Considerations for Parallel Scan

- A scan algorithm implemented with an infinite number of processors would not be efficient in practice because it doesn't fully utilize all processors, and data movement can be inefficient. (31m57s)
- A simpler approach for parallelizing a scan on two cores involves dividing the array in half, performing sequential scans on each half, and then parallelizing the application of the base to the second half. (33m17s)
- A CUDA or ISPC implementation of a scan for SIMD architectures would involve each thread computing its lane, performing a logarithmic number of additions based on its lane index, with some threads becoming inactive as the computation progresses. (35m11s)

## Work Efficiency vs. Utilization

- An algorithm that takes 10 cycles to complete work can be faster than one that takes 5 cycles if the computer has more execution units than the 5 cycle algorithm can utilize. (38m26s)
- When there are more threads than processing units, it can be beneficial to use a less work-efficient algorithm to ensure full utilization of processing power. (40m39s)

## Example: Parallel Scan of 1024 Elements

- A scan of 1024 elements can be completed in 11 cycles by breaking the data into 32 element chunks, performing a scan on each chunk in parallel, scanning the results of each chunk, and then updating each element with the results of the second scan. (42m36s)

## Optimizations in Data Parallel Primitives

- Sophisticated data parallel primitives are used to efficiently implement operations on machines. (43m30s)
- Nvidia likely uses these types of optimizations, particularly for larger Single Instruction Multiple Data (SIMD) widths, to improve power efficiency by a factor of 1.5 or more. (44m42s)

## Segmented Scan

- Segmented scan, a common operation in applications dealing with sequences of sequences, like graphs or particle simulations, can be implemented efficiently by adapting the work-efficient scan algorithm. (46m41s)

## Sparse Matrix Multiplication

- Compressed sparse row format stores sparse matrices by storing a sequence of nonzero values and their corresponding column indices. (51m19s)
- The format also includes an array indicating the starting index of each row's subsequence in the nonzero values and column indices arrays. (52m42s)
- To perform sparse matrix multiplication using this format, a "gather" operation can be used to create a new array containing the elements of the dense vector corresponding to the column indices of the nonzero values. (54m11s)
- A map operation then multiplies the nonzero values with the corresponding elements in the gathered array. (54m43s)
- Finally, a segmented scan operation sums the products across each row to obtain the final result. (54m57s)
- Sparse matrix multiplication can be parallelized by flattening the matrix and treating it as a data parallel computation. This approach is efficient when the number of non-zeros is significantly larger than the number of rows. (55m48s)

## Gather and Scatter Operations

- Gather and scatter are essential data movement operations in data parallel thinking. Gather creates a dense output from a sparse input using an index sequence, while scatter distributes dense data to potentially sparse locations based on indices. (57m11s)
- A scatter operation, often used in histogram computations, can be implemented using a sort and segmented scan if a dedicated scatter operation is unavailable. (1h0m18s)

## Data Structure for Particle Simulation

- The speaker describes a data structure that is a sequence of sequences, where the outer sequence represents a grid of cells and the inner sequences represent lists of particle IDs in each cell. (1h5m9s)

## Parallelizing Particle Simulation

- A naive implementation to create this data structure would involve iterating over each particle, computing the cell it belongs to, and appending the particle ID to the corresponding cell list, potentially using a lock for synchronization. (1h6m3s)
- This naive implementation is deemed to have limited parallelism due to potential synchronization bottlenecks caused by the lock. (1h6m49s)
- To reduce contention for a shared lock in a parallel computing context, one solution is to create a separate cell list for each worker thread. (1h7m37s)
- An alternative approach involves implementing a lock for each individual cell to minimize collisions between threads accessing different cells. (1h8m16s)
- A high-performance parallel sorting algorithm can be employed to group particle indices by their corresponding grid cells, effectively achieving a group-by operation. (1h12m0s)

## Creating a Uniform Grid

- To create a uniform grid in a data-parallel way, a data structure is created that stores the start and end indices of particles belonging to each cell in the grid. (1h13m41s)
- This data structure is constructed using a series of data-parallel operations: map, sort, and another map. (1h14m6s)

## Libraries for Data Parallel Programming

- Libraries like Nvidia's Thrust for GPUs and Apache Spark for distributed programming provide high-level primitives (map, sort, scan, segmented scan) based on these data-parallel concepts. (1h16m35s)