The PRAM Model

We now consider a second model of parallel computation based on synchronous parallel random access machines (PRAMs) (Willie 1979).
We can think of this as a model for running parallel machine code.
There are several equivalent ways we could formulate PRAMs, I am fixing below one way that we can do this for this CS255 class.
A parallel processor will consist of p processors, each with its own processor id, and each of which can be a viewed as supporting the RAM model of computation:
- Each processor has a fixed finite set of local registers 1, ..., m reserved to itself, we'll call accumulators.
- The `p`-processors share a global memory consisting of `M` registers, that each processor may read from and write to.
- Each processor gets the same program with each line numbered 1 to n, a program counter (PC) is used to say the current line.
- Each processor has a finite instruction set supporting direct and indirect addressing, simple branching, and arithmetic instructions. Below are some examples, of the kinds of instructions our RAMs support:
  - Read j, k -- read register j into accumulator k
  - Read (j),k -- look up value v in register j, then read register v into accumulator k
  - ReadAcc (k1),k2 -- look up value v in accumulator k1, then read register v into accumulator k2
  - LoadProcid k -- load the current processor id into accumulator k.
  - Load x,k -- load integer x into accumulator k
  - Store j, k -- store value of accumulator k into register j
  - Store(j), k -- look up value v in register j, then store in register v value of accumulator k
  - StoreAcc (k1), k2 -- look up value v in accumulator k1, then store in register v value of accumulator k2
  - Jump j -- set program counter to j
  - JNeg j, k -- set program counter to j if accumulator k is negative
  - JPos j, k -- set program counter to j if accumulator k is positive
  - JZero j, k -- set program counter to j if accumulator k is zero
  - Add j, k -- add the value of register j to the accumulator k's value
  - Sub j,k -- subtract the value of register j from the accumulator k's value
  - Half k -- divide accumulator k's value by 2 round down (aka shift left)
  - Halt -- halts execution of current processor.
- Each register and accumulator can hold one integer. I.e., a number like -20, 57, 0, etc.
At the start of a computation the program counter of each processor is 0, and all accumulators are 0. The input is written as a sequence of integers in the global memory registers.
A computation ends when all processors halt. The content of the global memory is the output. We say our program accepts an input if the 0th global memory location has a positive value when then computation halts.

More on the PRAM Model

A computation in the PRAM model proceeds as a series of parallel steps:

In each such step, each processor executes the single instruction given by its program counter. This can involve:
- Choosing a global memory location to read from
- Executing the instruction on the operand fetched together with any of the operands in local registers.
- Finally, writing a value to a single global memory location.
- Updating the program counter either by adding 1 (default) or according to what the instruction executed says.
We insist on synchrony: that is, each processor finishes the parallel step i before the parallel step i+1 begins.
Conflict resolution can be handled in different ways: Exclusive Read/Exclusive Write (EREW), Concurrent Read/Exlusive Write(CREW), and Concurrent Read/Concurrent Write (CRCW).
Exclusive means only one processor can do that action on a given memory location at a given time. Concurrent is the opposite of this.
We will mainly be interested in EREW and CREW PRAMs.

Example PRAM Program

0. LoadProcid 0
1. ReadAcc (0),1
2. JNeg 5, 1
3. Load 1, 1
4. Jump 6
5. Load 0, 1
6. StoreAcc (0), 1
7. Halt

Assume we have `M`-processors and our global memory is also `M`-integers.
The idea is this code replaces each nonnegative integer in the input with 1 and each negative integer with 0.
So if the input was (-1,0, 10, 5, -50) the output will be (0, 1, 1, 1, 0).
Since it can be quite painful to fully spec out machine code for our algorithms we tend to describe our algorithms at a slightly higher level, while remaining careful that any of the things we are saying a PRAM could do, could in principle be written in code as above.

Complexity Classes

We want to define complexity classes which characterize those problems which can be executed more efficiently if we have more processors.

Recall an alphabet, `Sigma`, is any fixed finite set. Let `Sigma^star` denote the set of strings over the alphabet `Sigma`. We define a language, `L`, over `Sigma` to be any `L subseteq Sigma^star`. The complement of a language `L`, `bar L` consists of those string `w` in `Sigma^star` but not in `L`.

Definition:

The class (uniform) NC consists of languages `L` that have a PRAM algorithm `A` such that for any `x in Sigma^star`,

`x in L => A(x)` accepts,
`x !in L => A(x)` rejects,
The number of processors used by `A` on `x` is polynomial in `|x|`, where `|x|` denotes the length of the string `x`. I.e., bounded by a function `c|x|^k` for some fixed `c`, `k`.
The number of steps used by `A` on `x` is polylogarithmic in `|x|`. I.e., bounded by a function `c(log|x|)^k` for some fixed `c`, `k`.

The class RNC (where addition to an input x in the global memory, one gets access to a input y of 0, 1 coin tosses) is defined the same way except we modify the first two conditions to:

`x in L => Pr_y{A(x, y) mbox( accepts)} >= 1/2`,
`x !in L => Pr_y{A(x, y) mbox( accepts)} = 0` rejects,

The class ZNC is RNC`cap`co-RNC where co-RNC consists of those languages whose complement is in RNC.

For each of these classes we can define an analogous function/algorithm class. I.e., `FNC` is the class of functions `f: NN -> NN` which on inputs of length `n` produces an output of length bounded by `p(n)` where `p` is a polynomial, such that the `i`th bit of the output of `f` is language in `NC`. We sometimes abuse notation and write NC (resp. RNC, ZNC) for both NC and FNC (resp. RNC and FRNC, ZNC and FZNC).

Quiz

Which of the following statements is true?

You can have a race condition not involving a write operation.
The `n times n` matrix multiply parallel algorithm from class involved computing the products of 8 pairs of `n/2 times n/2` matrices.
Any Java class extending Thread must override the sync() method.

Sorting on a PRAM

We now work towards a ZNC algorithm for sorting which run in `O(log n)` time doing a total of `O(n log n)` operations on all processors (Reischuk 1985). Let `P_i` denote the `i`th processor.

We begin by considering a PRAM variant of Quicksort:

If `n=1` stop.
Otherwise, pick a splitter uniformly at random from the `n` input elements
Each processor determines whether its element is bigger or smaller than the splitter.
Let `j` denote the rank of the splitter. If `j` is not in `[n/4, (3n)/4]` the step is declared a failure and we go back to step 1. Otherwise, we move the splitter to processor `P_j`. Each element that is smaller than the splitter is moved to a distinct processor is `P_i` such that `i lt j`. Each element that is larger than the splitter is moved to a distinct processor `P_k` where `k gt j`.
We sort recursively the data in the processors `P_1` through `P_(j-1)` and the data in `P_(j+1)` through `P_(n)`.

Notice this algorithm uses randomness, but it has zero error -- we can detect if we are in a bad case in which case we rerun -- it is in this sense it is a ZNC algorithm for sorting.

Analysis

The third step above is O(1) time.
To do step 4, we assume we have a global input array A and a global output arrays `B`, `C` and `S`. Each processor `P_i` sets a bit `b_i` in `B` in one of the global registers to `0` if its element is greater than the splitter and to `1` otherwise.
For all `i`, let `S_i=Sigma_(t le i)b_t`. Using a `log`-depth tree of processors, all the `S_i`'s can be computed in parallel in `O(log n)` steps into `S`. This allows us to map into the output array those elements which are smaller than `j`. Namely, each processor in parallel checks, if `b_i` is 1 (where `i` is its global id), if so, it maps the element in `A_i` to `C_(S_i)`. Using a similar setting of bits and computing sums, we can also handle all those element which should be mapped to locations larger than `j`. From this we can do stage 4 in `O(log n)` steps.
The expected number of time before step 1-4 succeeds is `1/(1/2) = 2` since half the `j`'s are in the interval `[n/4, (3n)/4]`. So the total runtime will be `O(log^2 n)`.
Remark: To output the bits `b_i` above takes only constant time. We could imagine rather than use `n` processors to output the bits we use `n/(log n)` processors, each outputing `log n` many bits in `O(log n)` time. Similarly, we can adjust the computation of the `S_i` to only need `n/(log n)` processors. We thus save using some processors at the expense of making the constants in the `O` larger.

Using More Splitters

Suppose we have `n` processors and `n` elements. Suppose the first `r` processors have values in sorted order.

We can use these `r` elements as splitters to sort the remaining `n-r` elements.
The goal is to insert the `n-r` unsorted elements among the splitters in the following sense:
- Each processor should end up with a distinct input element.
- Let `s_j` denote the `j`th largest splitter and `i_(s_j)` denote the index of the processor containing `s_j` after the insertion. Then for all `k lt i_(s_j)`, processor `P_k` contains an element that is smaller than `s_j` and for all `k gt i_(s_j)`, `P_k` contains an element that is larger than `s_j`.
We hope we can do this in `O(log n)` time.

A Good Choice for `r`, a Complete Algorithm (called BoxSort)

If so, we could pick `n^(1/2)` elements at random and then using all `n` processors sort them in `O(log n)` steps.
- We can imagine having an array `R` giving the indices of the randomly selected elements.
- After sorting, the array `R` has its indices rearranged so they give the selected elements in ascending order.
- To do the sorting we imagine for each `i` of the `sqrt(n)` many element of `R`, using `sqrt(n)-1` processors to compare it with the other `sqrt(n)-1` elements, then using `O(log n)` time to sum the values of these comparisons to determine the number of elements smaller than `i`.
Then using these sorted elements insert the remaining elements among them in `O(log n)` steps:
- Notice, using binary search, a processor `i` can determine between which two splitters pointed to by `R`, `A_i` should go in `O(log n)` time.
- So it can output a bit `b_i` saying whether location `i` is less than its splitter's location as pointed to by the index in `R` or not.
- We can then compute sums `S_i` of these bits in parallel as with the QuickSort case and move in the same fashion.
Treat the remaining elements that are inserted between splitter as subproblems, recur on each subproblem whose size exceed `log n`, otherwise, use LogSort:
- Compare each element in parallel with its neighbor first to its left, swap if necessary; then to its right, swap if necessary, do this O(log n) times.

Intuition on Splitting

Doing splitting should take `O(log mbox((the size of box we need to split)))`.
In a perfect world, at each level the expected size of the box goes down by a square root, so we get the sum
`O(log n + log n^(1/2) + log n^(1/4) +...) = O(log n +1/2log n +1/4logn +...) = `
`O((log n) cdot (1+ 1/2+1/4+ ...)) = O(log n)`
We will argue even in a non perfect world that with high probability the sum of the log of the sizes of the boxes along any path is `O(log n)`, so the runtime will be `O(log n)`.

Analysis of Splitting

To see the intuition of the last slide is true, partition the interval `[1,n]` into sub-intervals `I_0, I_1, ...` We will then bound the probability that a box whose size is in `I_k` has a child whose size is also in `I_k`.
Fix `gamma` and `d` such that `1/2 lt gamma lt 1` and `1 lt d lt 1/gamma`. For positive integers `k`, let `tau_k=d^k`, `rho_k= n^(gamma^k)`. Define `I_k=[rho_(k+1), rho_k]`.
`n = (log n)^{(log_(log n)2)log n }`. So if `gamma^k lt 1/ (log n log_(log n)2)`, then `rho_k= n^( gamma^k) lt log n`. This will happen for some `k lt c log log n`.
So we will only be interested in `O(log log n)` many intervals `I_k`.
For a box `B` in the tree, we let `alpha(B)=k` if `|B|` is in `I_k`.
In terms of our notation, the time to split Box `B` is `O(log rho_(alpha(B)))` .
For a root-leaf path, `P = (B_1, ..., B_t)`, the runtime is given by `sum_(j=1)^t log rho_(alpha(B_j))`.
The total runtime of the algorithm will be O of this plus log n (to sort the leaves).
Define the event `E_P` to be that the sequence `alpha(B_1), ..., alpha(B_t)` does not contain the value `k` more than `tau_k` times for `1 le k le c log log n`.
If `E_P` holds then the number of PRAM steps on path `P` will be:
`O(log n + sum_(k=1)^infty log tau_k gamma^k log n))`

The End of Sorting

Since `tau_k=d^k` and `d cdot gamma lt 1`, this sums to `O(log n)`.
So it suffices to show `E_P` happens with high probability.
Lemma. There is a constant `b gt 1` such that `E_P` holds with probability `1- exp(-log^b n)`.
Proof The proof of this is given as a sequence of exercises in the book which we omit. It makes use of Exercise 12.6 that follows from Chernoff bounds. (Chernoff Bounds: If `X` is the sum of independent random variables which outputs either 0 and 1, the latter with probability `p`, then for a `0 le theta le 1`, `Pr{X ge (1+theta)pn} lt e^(-(theta^2 p n)/3)`).
From this we can conclude:
Theorem. There is a constant `b gt 1` such that probability at least `1-exp(-log^b n)`, the algorithm BoxSort terminates in `O(log n)` steps.
So although in a bad case it might take longer, with high probability this is a `O(log n)` time algorithm.

PRAMs and PRAM Sorting

Outline

Introduction