Map Reduce and PRAMs




CS255

Chris Pollett

Mar 7, 2018

Outline

Introduction

Mappers and Reducers

Recall a multiset is an unordered collection of objects where repeats are allowed.

Definition. A mapper is a (possibly randomized) function that takes as input one ordered `langle key; value rangle` pair of binary strings. As output the mapper produces a finite multiset of new `langle key; value rangle` pairs.

Definition. A reducer is a (possibly randomized) function that takes as input a binary string `k` which is the key, and a sequence of values `v_1, v_2, ...` which are also binary strings. As output, the reducer produces a multiset of pairs of binary strings `langle k; v_(k,1)rangle , langle k; v_(k,2)rangle, langle k; v_(k,3)rangle, ...` The key in the output tuples is identical to the key in the input tuple.

So we allow mappers to manipulate keys arbitrarily, but reducers cannot change the keys at all.

A Map Reduce Program

In-Class Exercise

Suppose we have a web crawler, which as part of the crawling process, reads a page whose url is `i`, and generates a list `(i, j_1),...(i, j_t)` where we have one pair for each url `j_s` that `i` links to. The crawler runs on many pages in the same fashion, outputting all the tuples of this process to disk.

Design a map reduce job that take all these tuples and outputs all pairs `(j, t_1), ... (j, t_m)` where a pair `(j, t_{v})` indicates there is a link from `t_v` to a page which in turn links to `j`.

Post your answer to the Mar 7 In-Class Exercise Thread.

Assumptions of the Model

The Map Reduce Class.

Given a program input, a sequence of pairs `(k[j], v[j])`, for `j=1,2,3,...` where `k[j]` and `v[j]` are binary strings, we define the length of this input to be `n = sum_j(|k[j]| + |v[j]|)`, where `|a|` denotes the length of the binary string `a`.

Definition. Fix an `epsilon > 0`. An algorithm in `MRC^k` consists of a sequence `langle m[1], r[1], m[2], r[2], ..., m[R], r[R] rangle` of operations which outputs the correct answer with probability at least `3/4` where:

We define `MRC = cup_k MRC^k` and we define `DMRC` in an analogous fashion where we require our machines and the above operations to be deterministic.

One key thing to note is we allow the mappers and reducers to run in time polynomial in `n` not polynomial in the length of the input they receive.

DMRC is in P

Theorem. Languages in DMRC can be decided by RAMs running in polynomial time and using at most `O(n^2 log n)` space.

Proof. The idea is that we just want to compute all of the map reduce steps on a single machine. Note each mapper or reducer from the definition runs in at most polynomial time and uses at most sublinear in `n` space. We require the space used by these machines in round `i` to be sub-linear in the original input, not in the output of round `i-1`. In a given round the total output is sub-quadratically many key value pairs. Let `p(n)` be a polynomial bounding the run time of any mapper or reducer. So we could run each mapper on a single machine in a serial fashion, get their total outputs and use those to run each reducer serially on these outputs to generate the input for the next round. To simulate a single round would take time `O(n^2 cdot p(n))`, simulating all rounds would take time `O(log^k n cdot n^2 cdot p(n))`. We only need to keep the previous rounds output in memory at an given time, so we get the space bound.

Connections between NC and DMRC

The paper shows the following result using a padding argument on a version of the CIRCUIT VALUE PROBLEM. We skip the proof but state the result:

Theorem. If `P ne NC` then `DMRC` is not contained in `NC`.