More Multithreaded Algorithms




CS255

Chris Pollett

Feb 13, 2019

Outline

Introduction

Scheduling

Brent's Theorem (1974)

Theorem. On an ideal parallel computer with `P` processors, a greedy scheduler executes a multithreaded computation with work `T_1` and span `T_(infty)` in time
`T_P le T_1/P + T_(infty)`.

Proof of Brent's Theorem

We break the computation into complete and incomplete steps.

Consider the complete steps of the computation. In each complete step, the `P` processors perform `P` work. Suppose that the number of complete steps was greater than `lfloor T_1/P rfloor`. Then the total work of the complete steps is at least `P cdot(lfloor T_1/P rfloor + 1) = P lfloor T_1/P rfloor + P`
`= T_1 - (T_1 mod P) + P`
`> T_1`.
In the above, we are defining `(T_1 mod P) := T_1 - P(lfloor T_1/P rfloor)`, so `(T_1 mod P)` is 1 if `T_1` is divisible by `P` and `0` otherwise. The above inequality is a contradiction because this would imply that the `P` processors perform more work than the computation requires. Hence, we have that there are fewer than `lfloor T_1/P rfloor` complete steps.

Now consider an incomplete step. Let `G` be the DAG representing the entire computation. By replacing each strand of longer than unit time computations with a chain of unit time strands, we can assume each strand takes less than unit time. Let `G'` be the subgraph of `G` that has yet to be executed at the start of an incomplete step, let `G''` be the subgraph to be executed after the incomplete step. A longest path in a DAG must necessarily start at a vertex with in-degree 0. Since an incomplete step of a greedy scheduler executes all strands with indegree 0 in `G'` the length of the longest path in `G''` must be 1 less than the longest path in `G'`. So an incomplete step decreases the span of the unexecuted DAG by 1. Hence the number of incomplete steps is at most `T_(infty)`.

Since each step is either complete or incomplete, the theorem follows.

In-Class Exercise

Optimality of Greedy Scheduling

Brent's Theorem allows us to show that greedy scheduling is close to optimal in the following sense.

Corollary. The running time `T_P` of any multithreaded computation scheduled by a greedy scheduler on an ideal parallel computer with `P` processors is within a factor of 2 of optimal.

Proof. Let `T_P^star` be the running time produced by an optimal scheduler on `P` processors. The work and span laws give: `T_P^(star) ge max(T_1/P, T_(infty))`. Brent's Theorem on the other hand implies:
`T_P le T_1/P + T_(infty)`
`le 2 cdot max(T_1/P, T_(infty))`
`le 2T_P^(star)`. QED.

Perfect Linear Speedup versus Slackness

Our next result shows that as a computation gets more slack it gets closer to having perfect linear speedup.

Corollary. Let `T_P` be the running time of a multithreaded computation produced by a greedy scheduler on an ideal parallel computer with `P` processors. Then, if `P lt lt T_1 / T_(infty)` , we have `T_P ~~ T_1/P`. That is, the speedup is approximately `P`.

Proof. Suppose `P lt lt T_1 / T_(infty)`, then `T_(infty) lt lt T_1/P`. By Brent's Theorem,
`T_P le T_1/P + T_(infty) ~~ T_1/P`. As the work law says `T_P ge T_1/P`, we conclude `T_P ~~ T_1/P`. This implies the speedup is `T_1/T_P ~~ P`. QED.

Example. Suppose the slackness, `T_1/(P T_(infty))`, was greater than 10. Then the span, `T_(infty)` term in Brent's Theorem is less than 10% of the the work/processor term. So if a computation runs on only 10 or a 100 processors, it doesn't make sense to value parallelism, `T_1/T_(infty)`, of 1000000 over parallelism of 10000 even with the factor of 100 difference.

Analyzing Multithreaded Algorithms

How to analyze computation DAGs

Parallel Loops

Simulating Parallel For using Spawn and Sync

Analyzing MAT-VEC(A, x)