PRAMs and PRAM Sorting




CS255

Chris Pollett

Feb 25, 2019

Outline

Introduction

The PRAM Model

More on the PRAM Model

A computation in the PRAM model proceeds as a series of parallel steps:

Example PRAM Program

0. LoadProcid 0
1. ReadAcc (0),1
2. JNeg 5, 1
3. Load 1, 1
4. Jump 6
5. Load 0, 1
6. StoreAcc (0), 1
7. Halt

Complexity Classes

We want to define complexity classes which characterize those problems which can be executed more efficiently if we have more processors.

Recall an alphabet, `Sigma`, is any fixed finite set. Let `Sigma^star` denote the set of strings over the alphabet `Sigma`. We define a language, `L`, over `Sigma` to be any `L subseteq Sigma^star`. The complement of a language `L`, `bar L` consists of those string `w` in `Sigma^star` but not in `L`.

Definition:

The class (uniform) NC consists of languages `L` that have a PRAM algorithm `A` such that for any `x in Sigma^star`,

The class RNC (where addition to an input x in the global memory, one gets access to a input y of 0, 1 coin tosses) is defined the same way except we modify the first two conditions to:

The class ZNC is RNC`cap`co-RNC where co-RNC consists of those languages whose complement is in RNC.

For each of these classes we can define an analogous function/algorithm class. I.e., `FNC` is the class of functions `f: NN -> NN` which on inputs of length `n` produces an output of length bounded by `p(n)` where `p` is a polynomial, such that the `i`th bit of the output of `f` is language in `NC`. We sometimes abuse notation and write NC (resp. RNC, ZNC) for both NC and FNC (resp. RNC and FRNC, ZNC and FZNC).

Quiz

Which of the following statements is true?

  1. You can have a race condition not involving a write operation.
  2. The `n times n` matrix multiply parallel algorithm from class involved computing the products of 8 pairs of `n/2 times n/2` matrices.
  3. Any Java class extending Thread must override the sync() method.

Sorting on a PRAM

We now work towards a ZNC algorithm for sorting which run in `O(log n)` time doing a total of `O(n log n)` operations on all processors (Reischuk 1985). Let `P_i` denote the `i`th processor.

We begin by considering a PRAM variant of Quicksort:

  1. If `n=1` stop.
  2. Otherwise, pick a splitter uniformly at random from the `n` input elements
  3. Each processor determines whether its element is bigger or smaller than the splitter.
  4. Let `j` denote the rank of the splitter. If `j` is not in `[n/4, (3n)/4]` the step is declared a failure and we go back to step 1. Otherwise, we move the splitter to processor `P_j`. Each element that is smaller than the splitter is moved to a distinct processor is `P_i` such that `i lt j`. Each element that is larger than the splitter is moved to a distinct processor `P_k` where `k gt j`.
  5. We sort recursively the data in the processors `P_1` through `P_(j-1)` and the data in `P_(j+1)` through `P_(n)`.

Notice this algorithm uses randomness, but it has zero error -- we can detect if we are in a bad case in which case we rerun -- it is in this sense it is a ZNC algorithm for sorting.

Analysis

Using More Splitters

Suppose we have `n` processors and `n` elements. Suppose the first `r` processors have values in sorted order.

A Good Choice for `r`, a Complete Algorithm (called BoxSort)

  1. If so, we could pick `n^(1/2)` elements at random and then using all `n` processors sort them in `O(log n)` steps.
    • We can imagine having an array `R` giving the indices of the randomly selected elements.
    • After sorting, the array `R` has its indices rearranged so they give the selected elements in ascending order.
    • To do the sorting we imagine for each `i` of the `sqrt(n)` many element of `R`, using `sqrt(n)-1` processors to compare it with the other `sqrt(n)-1` elements, then using `O(log n)` time to sum the values of these comparisons to determine the number of elements smaller than `i`.
  2. Then using these sorted elements insert the remaining elements among them in `O(log n)` steps:
    • Notice, using binary search, a processor `i` can determine between which two splitters pointed to by `R`, `A_i` should go in `O(log n)` time.
    • So it can output a bit `b_i` saying whether location `i` is less than its splitter's location as pointed to by the index in `R` or not.
    • We can then compute sums `S_i` of these bits in parallel as with the QuickSort case and move in the same fashion.
  3. Treat the remaining elements that are inserted between splitter as subproblems, recur on each subproblem whose size exceed `log n`, otherwise, use LogSort:
    • Compare each element in parallel with its neighbor first to its left, swap if necessary; then to its right, swap if necessary, do this O(log n) times.

Intuition on Splitting

Analysis of Splitting

The End of Sorting