Map Reduce and PRAMs




CS255

Chris Pollett

Mar 9, 2015

Outline

Introduction

Mappers and Reducers

Recall a multiset is an unordered collection of objects where repeats are allowed.

Definition. A mapper is a (possibly randomized) function that takes as input one ordered `langle key; value rangle` pair of binary strings. As output the mapper produces a finite multiset of new `langle key; value rangle` pairs.

Definition. A reducer is a (possibly randomized) function that takes as input a binary string `k` which is the key, and a sequence of values `v_1, v_2, ...` which are also binary strings. As output, the reducer produces a multiset of pairs of binary strings `langle k; v_(k,1)rangle , langle k; v_(k,2)rangle, langle k; v_(k,3)rangle, ...` The key in the output tuples is identical to the key in the input tuple.

So we allow mappers to manipulate keys arbitrarily, but reducers cannot change the keys at all.

A Map Reduce Program

Quiz

Which of the following is true?

  1. Our algorithm for Byzantine agreement had expected runtime that depended on the number of servers.
  2. Our Asynchronous-CPP algorithm made use of timestamps.
  3. In map reduce, a combiner is another name for a process that executes a shuffle step.

Assumptions of the Model

The Map Reduce Class.

Given a program input, a sequence of pairs `(k[j], v[j])`, for `j=1,2,3,...` where `k[j]` and `v[j]` are binary strings, we define the length of this input to be `n = sum_j(|k[j]| + |v[j]|)`, where `|a|` denotes the length of the binary string `a`.

Definition. Fix an `epsilon > 0`. An algorithm in `MRC^k` consists of a sequence `langle m[1], r[1], m[2], r[2], ..., m[R], r[R] rangle` of operations which outputs the correct answer with probability at least `3/4` where:

We define `MRC = cup_k MRC^k` and we define `DMRC` in an analogous fashion where we require our machines and the above operations to be deterministic.

One key thing to note is we allow the mappers and reducers to run in time polynomial in `n` not polynomial in the length of the input they receive.

DMRC is in P

Theorem. Languages in DMRC can be decided by RAMs running in polynomial time and using at most `O(n^2 log n)` space.

Proof. The idea is that we just want to compute all of the map reduce steps on a single machine. Note each mapper or reducer from the definition runs in at most polynomial time and uses at most sublinear in `n` space. We require the space used by these machines in round `i` to be sub-linear in the original input, not in the output of round `i-1`. In a given round the total output is sub-quadratically many key value pairs. Let `p(n)` be a polynomial bound the running of any mapper or reducer. So we could run each mapper on a single machine in a serial fashion, get their total outputs and use those to run each reducer serially on these outputs to generate the input for the next round. To simulate a single round would take time `O(n^2 cdot p(n))`, simulating all rounds would take time `O(log^k n cdot n^2 cdot p(n))`. We only need to keep the previous rounds output in memory at an given time, so we get the space bound.

Connections between NC and DMRC

The paper shows the following result using a padding argument on a version of the CIRCUIT VALUE PROBLEM. We skip the proof but state the result:

Theorem. If `P ne NC` then `DMRC` is not contained in `NC`.