More Algorithms -- HITS

Ask.com (Teoma) used to use another kind of algorithm called HITs (Kleinberg 1998).
This works by iteratively computing two scores for a page: (a) an authority score measuring the sum of hubs linking to it and (b) a hub score measuring how good the authorities linked from this site are.
In its original formulation, it was query dependent in its calculation of (a) and (b) -- we would only consider links on the key words.
To be more precise, in HITs we initialize two scores for each page: `x_i^((0))`, the authority score, and `y_i^((0))`, the hub score to one over the number of pages.
We then compute `x_i^((k)) = sum_(j:e_(ji) in E)y_j^((k-1))` and `y_i^((k)) = sum_(j:e_(ij) in E)x_j^((k))`
Notice `x_i^((k))` is the sum over the links into `i` of the hub scores of the previous round and `y_i^((k))` is the sum over the links out of `i` of the authority scores.
We iterate through `k` until `x_i^((k))` and `x_i^((k+1))` as well as `y_i^((k))` and `y_i^((k+1))` are less than some `epsilon`.
If `L` is the adjacency matrix then one can see `x^((k)) = L^TL x^((k-1))` and `y^((k)) = LL^T y^((k-1))`.
Convergence of the iterations comes from the fact that `L^TL` and `LL^T` are symmetric positive semidefinite matrices.
Convergence of HITS is typically faster than page rank requiring on the order of 10-15 iterations; however, it might suffer from uniqueness issues.
Although in its original formulation it was query dependent, in practice, it was run in a query independent fashion.

SALSA

SALSA (Stochastic Approach to Link Structure Analysis) was proposed by Lempal and Moran (2000).
If we look at the matrix `L` from the HITs slide you notice that we never normalized the sum of columns or rows to make then sum to 1.
On the other hand, when computing Page Rank we explicitly made sure the rows summed to 1 and so we had a stochastic matrix.
The starting point of SALSA is to make two matrices `L_r` and `L_c` from `L`. In the first, we normalize the rows of `L` so they sum to 1. i.e., if a row has five `1` entries, we divide make each entry `1/5`. In the second, we normalize the columns to `1`.
We define a hub matrix `H` to be `L_rL_c^T` and we define an authority matrix `A` to be `L_c^TL_r`.
We then use these two matrices when iterating and computing the authority and hub vectors.
SALSA as an algorithm seems to be more immune to topic drift -- where a high ranking but off-topic page creeps up in the results than HITS.
Like HITs the vector produced by their algorithm might not be unique (i.e., if start with different length one initialization vectors get different answers).

In-Class Exercise

Consider the graph from Wikipedia below:
Initially, `x_i^((0))` and `y_j^((0))` as per the HITs slide, then compute `L_r`, `L_c`, `H`, `A`, `x_i^((1))`, and `y_j^((1))` according to the SALSA algorithm.
Please post your solutions to the Nov 29 In-Class Exercise Thread.

Recalling Map Reduce

So far we talked about map reduce as a model of parallel computation.
As described a map reduce job had three phases: (a) a map phase, where key/value pairs are read from the input and the map function is applied to each of them individually, (b) a shuffle phase, where the pairs produced during the map phase are sorted by their key, and all values for the same key are grouped together, (c) a reduce phase where the reduce function is applied to each key and its values.
We have machines (nodes) which act as mappers. They apply the map function to a subset of the key value pairs.
They then send the results to different reducer machines based on a hash of the key value for a key-value output by the map phase.
Before the reducer begins reducing the data from all its inputs, the data is typically stored in a distributed file system, in case the reducer fails -- that way we don't have to recompute the map phase it depended on.
Often mappers and reducers are replicated three ways to keep things operating smoothly.

Recalling Page Rank

These other conditions guaranteed that if we started with some guess `vec{r'}` for `vec(r)` and we computed powers `\vec{A}^{(n)}\vec{r'}` until
`||\vec{A}^{(n+1)}\vec{r'} - \vec{A}^{(n)}\vec{r'}|| < \epsilon`
then `\vec{A}^{(n)}\vec{r'} approx \vec{r}`.
Why is computing these powers hard?
Why would we even want to try to use Map Reduce to compute this?
Powering seems to involve some kind of iteration. So we at least need to extend our basic map reduce to handle iterating the map and reduce operations.

The Scary Numbers Behind the Google Matrix

At first blush because of the teleportation correction every single entry in the Google matrix is nonzero.
How many entries are there? Well the number of rows is the number of web pages, and the number of columns is the number of web pages.
So conservatively it is a `10^(10) times 10^(10)` matrix. i.e, it has `10^(20)` entries.
Let `n` be the rank of the matrix (in this case `10^(10)`). Multiplying the matrix against `vec{r'}` takes time `O(n^2)` using the usual way to do matrix multiplication.
So 50 iterations would take time `O(50n^2)` or order `5 times 10^(21)` operations!
Currently, a typical processor core does less than `10^(10)` ops per second. So it would take `5 times 10^(11)` seconds or about `15855` years to do on a single core.
There are fast matrix multiply techniques based on techniques originally due to Strassen (1969), but these tend to be hard to use in practice -- and it wouldn't save enough to be feasible on one machine.
The key thing that makes computing `vec(r)` feasible is that the Google matrix is essentially, a normalized adjacency matrix (which only has a small number of nonzero entries), together with corrections which can be computed in linear time in the rank.
So multiplying against `vec(r')` can be done in time proportional to the square of the nonzero entries in the adjacency matrix.
If we assume any given page links to on average `50` or so other pages, this reduces the time to multiply a single matrix row to something like `O(50n)` and the total time to something like `O(2500n)` -- still huge if `n = 10^10`, but doable if we can split the job among several machines.

High-Level Parallel Algorithm for Page Rank

Let epsilon be the constant we used to decide if we stop
Compute initial list of node objects, each with a page_rank field and an adjacency list.
   This whole list we'll call current_r and slightly abuse notation to view it as a 
    vector to which our matrices are applied
do {
    Store in distributed file system (DFS) pairs (nid, node) as (old_nid, old_node)
        where node is a node object (containing info about a web page)
    Do map reduce job to compute A*current_r 
        where A is the normalized adjacency matrix
    Store result in DFS as pairs (nid, node) where node has its page_rank field set to the
        value given by the above operation.
    Do map reduce job to compute dangling node correction to current_r
    Store result in DFS as (nid, node) where where node has its page_rank field set to the
        value given by the above operation.
    Do map reduce job to compute teleporter correction to current_r
    Store result in DFS as (nid, node) where where node has its page_rank field set to the
        value given by the above operation.
    Send all pairs (nid, node) in DFS computed above to reduce job which
        computes (nid, node) in which node has a page_rank equal to the sum of the three
        page_ranks that one would have grouping by nid.
    Store result in DFS as pairs (nid, node).
    Do map reduce job to compute len = || current_r - old_r||
} while(len > epsilon)
output nodes with their page ranks

As we can see, we iteratively apply map reduce jobs to data stored in the distributed file system
The amount of space we need to store stuff in the DFS is proportional to the nonzero entries in `A` and `vec(r)`.
Howard Karloff, Siddharth Suri, and Sergei Vassilvitskii (2010) studied what kinds of computations can be done by iteratively applying map reduce jobs `t` times , where each machine can store at most `n^(1 - epsilon)` of the input. They get the following interesting result:
Any CREW PRAM (concurrent read exclusive write parallel random access machine) algorithm using `O(n^(2 - 2 epsilon))` total memory, `O(n^(2 - 2 epsilon))` processors, and `t = t(n)` time can be run in `O(t)` rounds of map reduce jobs like we have described.

What do some of the map reduce jobs for page rank look like?

The following example of computing `A \cdot \mbox{current}_r` for the normalized adjacency matrix is from Lin and Dyer (2010):
In the above, the mapper gets as arguments, the id of a webpage and a node object with info about webpage.
It takes the node's current page rank, divides it by the number of outgoing links from the node to compute `p`.
It then outputs two kinds of pairs: The first kind, (nid, node), passes along the info id and the object it corresponds to the reducer; the second kind of pair, (mid, p), consists of the id of a page linked to by node and the proportion of the page rank it will receive from node.
The reducer then gets as argument a node id on which grouping was done and a list containing floats and a single node object. The reducer than adds the floats to get an interim page rank value, and finds the single node object. It sets the page_rank of the node object to the sum it computed, and outputs (nid, node).
Notice the adjacency matrix is implicitly stored in the adjacency lists of each node.
My hope is this gives a little bit of the flavor of how MapReduce is used.
Thinking back to our sort based index construction methods, MapReduce can also be used to compute inverted indexes as part of indexing.

Doc Quality, Page Rank via Map Reduce, Hadoop

Outline

More Algorithms -- HITS

SALSA

In-Class Exercise

Recalling Map Reduce

Recalling Page Rank

The Scary Numbers Behind the Google Matrix

High-Level Parallel Algorithm for Page Rank

What do some of the map reduce jobs for page rank look like?