Recalling Map Reduce

So far we talked about map reduce as a model of parallel computation.
As described a map reduce job had three phases: (a) a map phase, where key/value pairs are read from the input and the map function is applied to each of them individually, (b) a shuffle phase, where the pairs produced during the map phase are sorted by their key, and all values for the same key are grouped together, (c) a reduce phase where the reduce function is applied to each key and its values.
We have machines (nodes) which act as mappers. They apply the map function to a subset of the key value pairs.
They then send the results to different reducer machines based on a hash of the key value for a key-value output by the map phase.
Before the reducer begins reducing the data from all its inputs, the data is typically stored in a distributed file system, in case the reducer fails -- that way we don't have to recompute the map phase it depended on.
Often mappers and reducers are replicated three ways to keep things operating smoothly.

Recalling Page Rank

These other conditions guaranteed that if we started with some guess `vec{r'}` for `vec(r)` and we computed powers `\vec{A}^{(n)}\vec{r'}` until
`||\vec{A}^{(n+1)}\vec{r'} - \vec{A}^{(n)}\vec{r'}|| < \epsilon`
then `\vec{A}^{(n)}\vec{r'} approx \vec{r}`.
Why is computing these powers hard?
Why would we even want to try to use Map Reduce to compute this?
Powering seems to involve some kind of iteration. So we at least need to extend our basic map reduce to handle iterating the map and reduce operations.

The Scary Numbers Behind the Google Matrix

At first blush because of the teleportation correction every single entry in the Google matrix is nonzero.
How many entries are there? Well the number of rows is the number of web pages, and the number of columns is the number of web pages.
So conservatively it is a `10^(10) times 10^(10)` matrix. i.e, it has `10^(20)` entries.
Let `n` be the rank of the matrix (in this case `10^(10)`). Multiplying the matrix against `vec{r'}` takes time `O(n^2)` using the usual way to do matrix multiplication.
So 50 iterations would take time `O(50n^2)` or order `5 times 10^(21)` operations!
Currently, a typical processor core does less than `10^(10)` ops per second. So it would take `5 times 10^(11)` seconds or about `15855` years to do on a single core.
There are fast matrix multiply techniques based on techniques originally due to Strassen (1969), but these tend to be hard to use in practice -- and it wouldn't save enough to be feasible on one machine.
The key thing that makes computing `vec(r)` feasible is that the Google matrix is essentially, a normalized adjacency matrix (which only has a small number of nonzero entries), together with corrections which can be computed in linear time in the rank.
So multiplying against `vec(r')` can be done in time proportional to the square of the nonzero entries in the adjacency matrix.
If we assume any given page links to on average `50` or so other pages, this reduces the time of a single matrix to something like `O(50n)` and the total time to something like `O(2500n)` -- still huge if `n = 10^10`, but doable if we can split the job among several machines.

High-Level Parallel Algorithm for Page Rank

Let epsilon be the constant we used to decide if we stop
Compute initial list of node objects, each with a page_rank field and an adjacency list. This whole list
   we'll call current_r and slightly abuse notation to view it as a vector to which
   our matrices are applied
do {
    Store in distributed file system (DFS) pairs (nid, node) as (old_nid, old_node)
        where node is a node object (containing info about a web page)
    Do map reduce job to compute A*current_r 
        where A is the normalized adjacency matrix
    Store result in DFS as pairs (nid, node) where node has its page_rank field set to the
        value given by the above operation.
    Do map reduce job to compute dangling node correction to current_r
    Store result in DFS as (nid, node) where where node has its page_rank field set to the
        value given by the above operation.
    Do map reduce job to compute teleporter correction to current_r
    Store result in DFS as (nid, node) where where node has its page_rank field set to the
        value given by the above operation.
    Send all pairs (nid, node) in DFS computed above to reduce job which
        computes (nid, node) in which node has a page_rank equal to the sum of the three
        page_ranks that one would have grouping by nid.
    Store result in DFS as pairs (nid, node).
    Do map reduce job to compute len = || current_r - old_r||
} while(len > epsilon)
output nodes with their page ranks

As we can see, we iteratively apply map reduce jobs to data stored in the distributed file system
The amount of space we need to store stuff in the DFS is proportional to the nonzero entries in `A` and `vec(r)`.
Howard Karloff, Siddharth Suri, and Sergei Vassilvitskii (2010) studied what kinds of computations can be done by iteratively applying map reduce jobs `t` times , where each machine can store at most `n^(1 - epsilon)` of the input. They get the following interesting result:
Any CREW PRAM (concurrent read exclusive write parallel random access machine) algorithm using `O(n^(2 - 2 epsilon))` total memory, `O(n^(2 - 2 epsilon))` processors, and `t = t(n)` time can be run in `O(t)` rounds of map reduce jobs like we have described.

In-Class Exercise

Suppose the web consisted of four web pages, 1, 2, 3, and 4. Page 1 and 2 each linked to each other, Page 2 is also linked to Page 3, and Page 4 is by itself. There are no other links. Write down the normalized adjacency matrix A. Write down the dangling node correction matrix D. Write down the teleporter matrix H.

Start with the vector `\vec r = (1/4, 1/4, 1/4, 1/4)^T`. Then one round of the page rank algorithm corresponds to computing:
`((1 - \alpha)(A+D)+ \alpha H)\cdot \vec r = (1-\alpha) A\vec r + (1 -alpha)D\vec r + \alpha H\vec r`.
Let `\alpha = 0.2`. Compute the three summands of the right hand side above separately showing work. For each summand, observe how hard it is to compute an entry (as function of the dimension of `r`). When done, add the summands. Post your work to the Dec 5 Discussion Thread.

What do some of the map reduce jobs for page rank look like?

The following example of computing `A \cdot \mbox{current}_r` for the normalized adjacency matrix is from Lin and Dyer (2010):
In the above, the mapper gets as arguments, the id of a webpage and a node object with info about webpage.
It takes the node's current page rank, divides it by the number of outgoing links from the node to compute `p`.
It then outputs two kinds of pairs: The first kind, (nid, node), passes along the info id and the object it corresponds to the reducer; the second kind of pair, (mid, p), consists of the id of a page linked to by node and the proportion of the page rank it will receive from node.
The reducer then gets as argument a node id on which grouping was done and a list containing floats and a single node object. The reducer than adds the floats to get an interim page rank value, and finds the single node object. It sets the page_rank of the node object to the sum it computed, and outputs (nid, node).
Notice the adjacency matrix is implicitly stored in the adjacency lists of each node.
My hope is this gives a little bit of the flavor of how MapReduce is used.
Thinking back to our sort based index construction methods, MapReduce can also be used to compute inverted indexes as part of indexing.

Map Reduce Using Hadoop

Hadoop is an open source reimplementation of the Google File System and Map Reduce created in 2003 by Doug Cutting and Mike Cafarella then at Yahoo.
Currently, it is an Apache project.
To give a flavor of an actually coded Map Reduce algorithm we briefly present how to write the word count example from earlier in terms of Hadoop.
You can use apt, brew, choco to get an installation of Hadoop suitabble for single machine development or testing. For example, I did
```
brew install hadoop
```
to install Hadoop on my laptop.
From the command line one runs Hadoop with lines like:
```
hadoop some-command args
```
For example, if some command is jar then it runs the map-reduce job given by the arguments.
Without any arguments hadoop gives the list of possibilities for some-command, one interesting subset of which is fs, which indicates a command for the hadoop distributed file system.
Hadoop has as one of its components YARN (Yet Another Resource Negotiator) which is used to manage and locate resources which might be distributed across several machines. YARN can be run from the command line using the command yarn.

Word Count as a Hadoop Map Reduce job

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount
{
    /* This inner class is used for our Map job
       First two params to Mapper are types of KeyIn ValueIn
       last two are types of KeyOut, ValueOut
     */
    public static class WCMapper extends Mapper<Object,Text,Text,IntWritable>
    {
        public void map(Object key, Text value, Context context)
            throws IOException, InterruptedException
        {
            // normalize document case, get rid of non word chars
            String document = value.toString().toLowerCase()
                .replaceAll("[^a-z\\s]", "");
            String[] words = document.split(" ");
            for (String word : words) {
                Text textWord = new Text(word);
                IntWritable one = new IntWritable(1);
                context.write(textWord, one);
            }
        }
    }
    /* This inner class is used for our Reducer job
       First two params to Reducer are types of KeyIn ValueIn
       last two are types of KeyOut, ValueOut
     */
    public static class WCReducer extends Reducer<Text, IntWritable,
        Text, IntWritable>
    {
        public void reduce(Text key, Iterable<IntWritable> values,
            Context context) throws IOException, InterruptedException
        {
            int sum = 0;
            IntWritable result = new IntWritable();
            for (IntWritable val: values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }
    public static void main (String[] args) throws Exception
    {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "wordcount");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(WCMapper.class);
        job.setReducerClass(WCReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        TextInputFormat.addInputPath(job, new Path(args[0]));
        Path outputDir = new Path(args[1]);
        FileOutputFormat.setOutputPath(job, outputDir);
        FileSystem fs = FileSystem.get(conf);
        fs.delete(outputDir, true);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Compiling and Executing WordCount.java on Hadoop

To compile WordCount.java we need to make sure we have all the relevant classpaths for the Hadoop classes we want.

To do this we use YARN:

javac -classpath `yarn classpath` -d . WordCount.java

We then make a jar file out of our compiled program:

jar -cf WordCount.jar WordCount.class 'WordCount$WCMapper.class' 'WordCount$WCReducer.class'

We can then run the program on hadoop:

hadoop jar WordCount.jar WordCount /Users/cpollett/tmp.txt /Users/cpollett/output

Here tmp.txt is the file we want to count the words for, /Users/cpollett/output is where we are outputting the results in HDFS format for the job.
After running the job, we could see what files were in the output folder with:
```
hadoop fs -ls /Users/cpollett/output/
```
Finally, if we want to see the counts, we could type:
```
hadoop fs -cat /Users/cpollett/output/part-r-00000
```

In my case, this outputs:

2018-12-05 11:28:01,201 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
	1
foo	2
la	3

Map Reduce, Page Rank, Hadoop

Outline