Introduction

Last week we were talking about document quality measures.
We said that one of the earliest, and since it was used by Google, most famous such measures was page rank.
We showed that the amount of computation needed to calculate page rank for an index of 10 billion pages was not feasible unless done on many machines.
We then showed how it might be done on several machines using the map reduce framework we had introduced earlier.
So far, we have not discussed any real world implementation of map reduce.
We start today, by giving a very brief introduction to Hadoop.

Map Reduce Using Hadoop

Hadoop is an open source re-implementation of the Google File System and Map Reduce created in 2003 by Doug Cutting and Mike Cafarella then at Yahoo.
Currently, it is an Apache project.
To give a flavor of an actually coded Map Reduce algorithm we briefly present how to write the word count example from earlier in terms of Hadoop.
You can use apt, brew, choco to get an installation of Hadoop suitable for single machine development or testing. For example, I did
```
brew install hadoop
```
to install Hadoop on my laptop.
From the command line one runs Hadoop with lines like:
```
hadoop some-command args
```
For example, if some command is jar then it runs the map-reduce job given by the arguments.
Without any arguments hadoop gives the list of possibilities for some-command, one interesting subset of which is fs, which indicates a command for the hadoop distributed file system.
Hadoop has as one of its components YARN (Yet Another Resource Negotiator) which is used to manage and locate resources which might be distributed across several machines. YARN can be run from the command line using the command yarn.

Word Count as a Hadoop Map Reduce job

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount
{
    /* This inner class is used for our Map job
       First two params to Mapper are types of KeyIn ValueIn
       last two are types of KeyOut, ValueOut
     */
    public static class WCMapper extends Mapper<Object,Text,Text,IntWritable>
    {
        public void map(Object key, Text value, Context context)
            throws IOException, InterruptedException
        {
            // normalize document case, get rid of non word chars
            String document = value.toString().toLowerCase()
                .replaceAll("[^a-z\\s]", "");
            String[] words = document.split(" ");
            for (String word : words) {
                Text textWord = new Text(word);
                IntWritable one = new IntWritable(1);
                context.write(textWord, one);
            }
        }
    }
    /* This inner class is used for our Reducer job
       First two params to Reducer are types of KeyIn ValueIn
       last two are types of KeyOut, ValueOut
     */
    public static class WCReducer extends Reducer<Text, IntWritable,
        Text, IntWritable>
    {
        public void reduce(Text key, Iterable<IntWritable> values,
            Context context) throws IOException, InterruptedException
        {
            int sum = 0;
            IntWritable result = new IntWritable();
            for (IntWritable val: values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }
    public static void main (String[] args) throws Exception
    {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "wordcount");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(WCMapper.class);
        job.setReducerClass(WCReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        TextInputFormat.addInputPath(job, new Path(args[0]));
        Path outputDir = new Path(args[1]);
        FileOutputFormat.setOutputPath(job, outputDir);
        FileSystem fs = FileSystem.get(conf);
        fs.delete(outputDir, true);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Compiling and Executing WordCount.java on Hadoop

To compile WordCount.java we need to make sure we have all the relevant classpaths for the Hadoop classes we want.

To do this we use YARN:

javac -classpath `yarn classpath` -d . WordCount.java

We then make a jar file out of our compiled program:

jar -cf WordCount.jar WordCount.class 'WordCount$WCMapper.class' 'WordCount$WCReducer.class'

We can then run the program on hadoop:

hadoop jar WordCount.jar WordCount /Users/cpollett/tmp.txt /Users/cpollett/output

Here tmp.txt is the file we want to count the words for, /Users/cpollett/output is where we are outputting the results in HDFS format for the job.
After running the job, we could see what files were in the output folder with:
```
hadoop fs -ls /Users/cpollett/output/
```
Finally, if we want to see the counts, we could type:
```
hadoop fs -cat /Users/cpollett/output/part-r-00000
```

In my case, this outputs:

2018-12-05 11:28:01,201 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
	1
foo	2
la	3

Quiz

Which of the following is true?

HITS was originally formulated as a query dependent document quality measure.
Traffic rank makes use of the power method for computing eigenvectors.
Page rank can be implemented as a one round map reduce job.

Long-Term Recurring Information Needs

Our next topic involves information needs that are longstanding, recurring, or common for a large population of users.
For example, given a set of text documents determine the language that each is written in?
Another example, given a set of documents, determine which ones are spam?
The information need in both cases is commonly understood and likely to a rise in many circumstances.
For example, a search engine might need to identify the language to determine a relevant response, a customer support help desk, might need to identify the language of a email to route it to someone capable of reading it.
A spam filter might prevent a user wasting unnecessary time on emails which are not useful. It might also prevent the indexing of web content which prevents the return of more relevant documents.
As yet another example, in handling medical data, we might need to be able to automatically filter current medical information from historical information.

Categorization and Filtering

To be more precise on the problems we'd like to solve we introduce the following terminology:
Categorization is the process of labeling documents to satisfy some information need. For example, label documents ham (relevant) or spam (not relevant).
Filtering is the process of evaluating on an ongoing basis according to satisfy some information need. For example, routing emails according to language or scanning medical documents and routing those related to mental health to mental health care professionals in a newsletter.
The term routing is often used synonymously with filtering but with the added connotation that the documents are delivered to distinct locations.
Another term for filtering is selective dissemination of information (SDI).
In some sense both classes can be thought of as duals to traditional search: In search, we want to find the documents satisfying the information need; whereas, in categorization and filtering, we want to find the information need satisfied by the document.

Approaches to Categorization and Filtering

Categorization and filtering tasks are often specified differently then in a search setting. For example, one does not generally search just for "spam documents" or "Swahili documents" in the search setting.
Also, the longstanding nature of the task (for example, deliver medical documents meeting some criteria) often makes it worthwhile to train the system in more detail to handle the specific need.
One approach (expert systems) to doing this is have an expert manually specify rules in some restricted languages for each category.
Only documents meeting the conditions of each of the rules for the category would belong to the category.
We will focus on automated approaches to categorization based on machine learning. These produce programs called classifiers to perform the categorization task.
Just as a search engine can return either a fixed set of documents (Boolean retrieval) or a ranked list, classifiers may be either hard (belongs to category or not) or soft (confidence score).
Soft classification is often called ranking, and the study of soft classifiers is called learning to rank.
The ranking functions we discussed earlier, such as BM25 or DFR, can be considered as a kind of soft classifier.
We now look at several common filtering and categorization problems, look at ways to solve, and look at ways to evaluate our solutions...

Topic-Oriented Batch Filtering

Consider the health-care task we presented earlier.
At the start, there are no documents. As new documents arrive, we could imagine filtering them.
If we can afford to wait, we can accumulate documents into a corpus, index them, apply a search technique like BM25 to rank them according to the health categories desired, and send them to the appropriate users.
We can then wait for more documents to arrive and repeat the process.
This approach is called batch filtering.
This approach is viable if enough documents arrive within the repetition window, say once a week, once a month, etc.

Example of Topic Oriented Filtering

The above table shows the results of doing batch filtering for our medical task (Topic 283) on the TREC 45 corpus using BM25.
The query used to rank documents for relevance was `langle` "mental", "illness", "drugs" `rangle`.
`N=140,651` Financial Times articles from 1993 onward in TREC45 were used. Of these 67 were relevant to Topic 283.
Four character char grams were used for tokens. (According to the book, stemming and not stemming approaches yielded similar results).
Documents were grouped into batches of fixed size, depending on the experiment. For example for the1000 batch experiment, the batch size consisted of 141 documents.
Each batch was indexed independently of the rest of the collection. Each batch is presented to the BM25 filter one at a time in chronological order.
To evaluate the results of more than one batch, the table above shows the simplistic approach of evaluating each performance measure separately for each batch, then average these results to compute a summary measure.

Issues With Evaluating the Result

Some problems with evaluating as we did on the last slide are:
- When we divide into more batches, the number of documents/batch goes down, and so the likelihood a batch has no relevant documents increases. (`|Rel| = 0`). `AP` (average precision) will be undefined if any batch has `|Rel|=0`. So any measure that depends on `|Rel|` will not work for small batches.
- Measures like precision at `k` are heavily influenced by `n`, making it impossible to compare results from different batch sizes.
We can see these effect in our tables from the last slide -- the `AP` values for the Second Bacth as well as the average are undefined for greater than 10 batches. Both the `AP` and `P`@`10` results also don't seem to be very stable with respect to batch size.

Size-Specific Precision at `k`

One way to solve the issue with `P`@`k` of the last slides is to compute size specific `P`@`k`. I.e., `k` used for evaluating batches sizes of size `n` is set to `\lfloor rho n\rfloor` where `rho` is some constant.
Typically, `rho` is chosen to be `k/N` where `N` is the corpus size and `k` is roughly the total number of documents presented to the user.
The table above show using this approaches gives more stable results across batch sizes.

Aggregate Precision at `k`

One drawback with size specific precision at `P`@`k`, is that the number of relevant documents can vary between batches, but we are always using the same value for `k`.
Instead, of fixing `k` across all of our batches of a given size, we can use the fact that we have a scoring function `s` (BM25 in our example) and in a given batch present to the user all documents which have a relevance score above some threshold `t`.
We choose `t` so that `k=rho N` documents will have score `s >t`.
We define aggregate `P`@`k` and `AP` using this thresholding idea.
The table above shows this approach is even more stable than Size-Specific Precision and AP.

Hadoop, Categorization and Filtering

Outline

Introduction

Map Reduce Using Hadoop

Word Count as a Hadoop Map Reduce job

Compiling and Executing WordCount.java on Hadoop

Quiz

Long-Term Recurring Information Needs

Categorization and Filtering

Approaches to Categorization and Filtering

Topic-Oriented Batch Filtering

Example of Topic Oriented Filtering

Issues With Evaluating the Result

Size-Specific Precision at `k`

Aggregate Precision at `k`