More Algorithms -- HITS

Ask.com (Teoma) used to use another kind of algorithm called HITs (Kleinberg 1998).
This works by iteratively computing two scores for a page: (a) an authority score measuring the sum of hubs linking to it and (b) a hub score measuring how good the authorities linked from this site are.
In its original formulation, it was query dependent in its calculation of (a) and (b) -- we would only consider links on the key words.
To be more precise, in HITs we initialize two scores for each page: `x_i^((0))`, the authority score, and `y_i^((0))`, the hub score to one over the number of pages.
We then compute `x_i^((k)) = sum_(j:e_(ji) in E)y_j^((k-1))` and `y_i^((k)) = sum_(j:e_(ij) in E)x_j^((k))`
Notice `x_i^((k))` is the sum over the links into `i` of the hub scores of the previous round and `y_i^((k))` is the sum over the links out of `i` of the authority scores.
We iterate through `k` until `x_i^((k))` and `x_i^((k+1))` as well as `y_i^((k))` and `y_i^((k+1))` are less than some `epsilon`.
If `L` is the adjacency matrix then one can see `x^((k)) = L^TL x^((k-1))` and `y^((k)) = LL^T y^((k-1))`.
Convergence of the iterations comes from the fact that `L^TL` and `LL^T` are symmetric positive semidefinite matrices.
Convergence of HITS is typically faster than page rank requiring on the order of 10-15 iterations; however, it might suffer from uniqueness issues.
Although in its original formulation it was query dependent, in practice, it was run in a query independent fashion.

SALSA

SALSA (Stochastic Approach to Link Structure Analysis) was proposed by Lempal and Moran (2000).
If we look at the matrix `L` from the HITs slide you notice that we never normalized the sum of columns or rows to make then sum to 1.
On the other hand, when computing Page Rank we explicitly made sure the rows summed to 1 and so we had a stochastic matrix.
The starting point of SALSA is to make two matrices `L_r` and `L_c` from `L`. In the first, we normalize the rows of `L` so they sum to 1. i.e., if a row has five `1` entries, we divide make each entry `1/5`. In the second, we normalize the columns to `1`.
We define a hub matrix `H` to be `L_rL_c^T` and we define an authority matrix `A` to be `L_c^TL_r`.
We then use these two matrices when iterating and computing the authority and hub vectors.
SALSA as an algorithm seems to be more immune to topic drift -- where a high ranking but off-topic page creeps up in the results than HITS.
Like HITs the vector produced by their algorithm might not be unique (i.e., if start with different length one initialization vectors get different answers).

Recalling Map Reduce

So far we talked about map reduce as a model of parallel computation.
As described a map reduce job had three phases: (a) a map phase, where key/value pairs are read from the input and the map function is applied to each of them individually, (b) a shuffle phase, where the pairs produced during the map phase are sorted by their key, and all values for the same key are grouped together, (c) a reduce phase where the reduce function is applied to each key and its values.
We have machines (nodes) which act as mappers. They apply the map function to a subset of the key value pairs.
They then send the results to different reducer machines based on a hash of the key value for a key-value output by the map phase.
Before the reducer begins reducing the data from all its inputs, the data is typically stored in a distributed file system, in case the reducer fails -- that way we don't have to recompute the map phase it depended on.
Often mappers and reducers are replicated three ways to keep things operating smoothly.

Recalling Page Rank

These other conditions guaranteed that if we started with some guess `vec{r'}` for `vec(r)` and we computed powers `\vec{A}^{(n)}\vec{r'}` until
`||\vec{A}^{(n+1)}\vec{r'} - \vec{A}^{(n)}\vec{r'}|| < \epsilon`
then `\vec{A}^{(n)}\vec{r'} approx \vec{r}`.
Why is computing these powers hard?
Why would we even want to try to use Map Reduce to compute this?
Powering seems to involve some kind of iteration. So we at least need to extend our basic map reduce to handle iterating the map and reduce operations.

The Scary Numbers Behind the Google Matrix

At first blush because of the teleportation correction every single entry in the Google matrix is nonzero.
How many entries are there? Well the number of rows is the number of web pages, and the number of columns is the number of web pages.
So conservatively it is a `10^(10) times 10^(10)` matrix. i.e, it has `10^(20)` entries.
Let `n` be the rank of the matrix (in this case `10^(10)`). Multiplying the matrix against `vec{r'}` takes time `O(n^2)` using the usual way to do matrix multiplication.
So 50 iterations would take time `O(50n^2)` or order `5 times 10^(21)` operations!
Currently, a typical processor core does less than `10^(10)` ops per second. So it would take `5 times 10^(11)` seconds or about `15855` years to do on a single core.
There are fast matrix multiply techniques based on techniques originally due to Strassen (1969), but these tend to be hard to use in practice -- and it wouldn't save enough to be feasible on one machine.
The key thing that makes computing `vec(r)` feasible is that the Google matrix is essentially, a normalized adjacency matrix (which only has a small number of nonzero entries), together with corrections which can be computed in linear time in the rank.
So multiplying against `vec(r')` can be done in time proportional to the square of the nonzero entries in the adjacency matrix.
If we assume any given page links to on average `50` or so other pages, this reduces the time of a single matrix to something like `O(50n)` and the total time to something like `O(2500n)` -- still huge if `n = 10^10`, but doable if we can split the job among several machines.

High-Level Parallel Algorithm for Page Rank

Let epsilon be the constant we used to decide if we stop
Compute initial list of node objects, each with a page_rank field and an adjacency list. This whole list
   we'll call current_r and slightly abuse notation to view it as a vector to which
   our matrices are applied
do {
    Store in distributed file system (DFS) pairs (nid, node) as (old_nid, old_node)
        where node is a node object (containing info about a web page)
    Do map reduce job to compute A*current_r 
        where A is the normalized adjacency matrix
    Store result in DFS as pairs (nid, node) where node has its page_rank field set to the
        value given by the above operation.
    Do map reduce job to compute dangling node correction to current_r
    Store result in DFS as (nid, node) where where node has its page_rank field set to the
        value given by the above operation.
    Do map reduce job to compute teleporter correction to current_r
    Store result in DFS as (nid, node) where where node has its page_rank field set to the
        value given by the above operation.
    Send all pairs (nid, node) in DFS computed above to reduce job which
        computes (nid, node) in which node has a page_rank equal to the sum of the three
        page_ranks that one would have grouping by nid.
    Store result in DFS as pairs (nid, node).
    Do map reduce job to compute len = || current_r - old_r||
} while(len > epsilon)
output nodes with their page ranks

As we can see, we iteratively apply map reduce jobs to data stored in the distributed file system
The amount of space we need to store stuff in the DFS is proportional to the nonzero entries in `A` and `vec(r)`.
Howard Karloff, Siddharth Suri, and Sergei Vassilvitskii (2010) studied what kinds of computations can be done by iteratively applying map reduce jobs `t` times , where each machine can store at most `n^(1 - epsilon)` of the input. They get the following interesting result:
Any CREW PRAM (concurrent read exclusive write parallel random access machine) algorithm using `O(n^(2 - 2 epsilon))` total memory, `O(n^(2 - 2 epsilon))` processors, and `t = t(n)` time can be run in `O(t)` rounds of map reduce jobs like we have described.

Quiz

Which of the following is true?

The formula we got to determine the odds of getting the top `m` results overall by requesting the top `k` results from `n` query nodes makes use of the binomial distribution.
A map reduce job must be run on at least two machines.
The power method to compute eigenvectors works for any adjacency matrix.

What do some of the map reduce jobs for page rank look like?

The following example of computing `A \cdot \mbox{current}_r` for the normalized adjacency matrix is from Lin and Dyer (2010):
In the above, the mapper gets as arguments, the id of a webpage and a node object with info about webpage.
It takes the node's current page rank, divides it by the number of outgoing links from the node to compute `p`.
It then outputs two kinds of pairs: The first kind, (nid, node), passes along the info id and the object it corresponds to the reducer; the second kind of pair, (mid, p), consists of the id of a page linked to by node and the proportion of the page rank it will receive from node.
The reducer then gets as argument a node id on which grouping was done and a list containing floats and a single node object. The reducer than adds the floats to get an interim page rank value, and finds the single node object. It sets the page_rank of the node object to the sum it computed, and outputs (nid, node).
Notice the adjacency matrix is implicitly stored in the adjacency lists of each node.
My hope is this gives a little bit of the flavor of how MapReduce is used.
Thinking back to our sort based index construction methods, MapReduce can also be used to compute inverted indexes as part of indexing.

Map Reduce Using Hadoop

Hadoop is an open source re-implementation of the Google File System and Map Reduce created in 2003 by Doug Cutting and Mike Cafarella then at Yahoo.
Currently, it is an Apache project.
To give a flavor of an actually coded Map Reduce algorithm we briefly present how to write the word count example from earlier in terms of Hadoop.
You can use apt, brew, choco to get an installation of Hadoop suitable for single machine development or testing. For example, I did
```
brew install hadoop
```
to install Hadoop on my laptop.
From the command line one runs Hadoop with lines like:
```
hadoop some-command args
```
For example, if some command is jar then it runs the map-reduce job given by the arguments.
Without any arguments hadoop gives the list of possibilities for some-command, one interesting subset of which is fs, which indicates a command for the hadoop distributed file system.
Hadoop has as one of its components YARN (Yet Another Resource Negotiator) which is used to manage and locate resources which might be distributed across several machines. YARN can be run from the command line using the command yarn.

Word Count as a Hadoop Map Reduce job

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount
{
    /* This inner class is used for our Map job
       First two params to Mapper are types of KeyIn ValueIn
       last two are types of KeyOut, ValueOut
     */
    public static class WCMapper extends Mapper<Object,Text,Text,IntWritable>
    {
        public void map(Object key, Text value, Context context)
            throws IOException, InterruptedException
        {
            // normalize document case, get rid of non word chars
            String document = value.toString().toLowerCase()
                .replaceAll("[^a-z\\s]", "");
            String[] words = document.split(" ");
            for (String word : words) {
                Text textWord = new Text(word);
                IntWritable one = new IntWritable(1);
                context.write(textWord, one);
            }
        }
    }
    /* This inner class is used for our Reducer job
       First two params to Reducer are types of KeyIn ValueIn
       last two are types of KeyOut, ValueOut
     */
    public static class WCReducer extends Reducer<Text, IntWritable,
        Text, IntWritable>
    {
        public void reduce(Text key, Iterable<IntWritable> values,
            Context context) throws IOException, InterruptedException
        {
            int sum = 0;
            IntWritable result = new IntWritable();
            for (IntWritable val: values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }
    public static void main (String[] args) throws Exception
    {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "wordcount");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(WCMapper.class);
        job.setReducerClass(WCReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        TextInputFormat.addInputPath(job, new Path(args[0]));
        Path outputDir = new Path(args[1]);
        FileOutputFormat.setOutputPath(job, outputDir);
        FileSystem fs = FileSystem.get(conf);
        fs.delete(outputDir, true);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Compiling and Executing WordCount.java on Hadoop

To compile WordCount.java we need to make sure we have all the relevant classpaths for the Hadoop classes we want.

To do this we use YARN:

javac -classpath `yarn classpath` -d . WordCount.java

We then make a jar file out of our compiled program:

jar -cf WordCount.jar WordCount.class 'WordCount$WCMapper.class' 'WordCount$WCReducer.class'

We can then run the program on hadoop:

hadoop jar WordCount.jar WordCount /Users/cpollett/tmp.txt /Users/cpollett/output

Here tmp.txt is the file we want to count the words for, /Users/cpollett/output is where we are outputting the results in HDFS format for the job.
After running the job, we could see what files were in the output folder with:
```
hadoop fs -ls /Users/cpollett/output/
```
Finally, if we want to see the counts, we could type:
```
hadoop fs -cat /Users/cpollett/output/part-r-00000
```

In my case, this outputs:

2018-12-05 11:28:01,201 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
	1
foo	2
la	3

Long-Term Recurring Information Needs

Our next topic involves information needs that are longstanding, recurring, or common for a large population of users.
For example, given a set of text documents determine the language that each is written in?
As another example, given a set of documents, determine which ones are spam?
The information need in both cases is commonly understood and likely to a rise in many circumstances.
For example, a search engine might need to identify the language to determine a relevant response, a customer support help desk, might need to identify the language of a email to route it to someone capable of reading it.
A spam filter might prevent a user wasting unnecessary time on emails which are not useful. It might also prevent the indexing of web content which prevents the return of more relevant documents.
As yet another example, in handling medical data, we might need to be able to automatically filter current medical information from historical information.

Categorization and Filtering

To be more precise on the problems we'd like to solve we introduce the following terminology:
Categorization is the process of labeling documents to satisfy some information need. For example, label documents ham (relevant) or spam (not relevant).
Filtering is the process of evaluating on an ongoing basis according to satisfy some information need. For example, routing emails according to language or scanning medical documents and routing those related to mental health to mental health care professionals in a newsletter.
The term routing is often used synonymously with filtering but with the added connotation that the documents are delivered to distinct locations.
Another term for filtering is selective dissemination of information (SDI).
In some sense both classes can be thought of as duals to traditional search: In search, we want to find the documents satisfying the information need; whereas, in categorization and filtering, we want to find the information need satisfied by the document.

Approaches to Categorization and Filtering

Categorization and filtering tasks are often specified differently then in a search setting. For example, one does not generally search just for "spam documents" or "Swahili documents" in the search setting.
Also, the longstanding nature of the task (for example, deliver medical documents meeting some criteria) often makes it worthwhile to train the system in more detail to handle the specific need.
One approach (expert systems) to doing this is have an expert manually specify rules in some restricted languages for each category.
Only documents meeting the conditions of each of the rules for the category would belong to the category.
We will focus on automated approaches to categorization based on machine learning. These produce programs called classifiers to perform the categorization task.
Just as a search engine can return either a fixed set of documents (Boolean retrieval) or a ranked list, classifiers may be either hard (belongs to category or not) or soft (confidence score).
Soft classification is often called ranking, and the study of soft classifiers is called learning to rank.
The ranking functions we discussed earlier, such as BM25 or DFR, can be considered as a kind of soft classifier.
On Wednesday, we will look at several common filtering and categorization problems, look at ways to solve, and look at ways to evaluate our solutions...

Doc Quality, Page Rank via Map Reduce, Hadoop

Outline

More Algorithms -- HITS

SALSA

Recalling Map Reduce

Recalling Page Rank

The Scary Numbers Behind the Google Matrix

High-Level Parallel Algorithm for Page Rank

Quiz

What do some of the map reduce jobs for page rank look like?

Map Reduce Using Hadoop

Word Count as a Hadoop Map Reduce job

Compiling and Executing WordCount.java on Hadoop

Long-Term Recurring Information Needs

Categorization and Filtering

Approaches to Categorization and Filtering