Map Reduce on Hadoop - Categorization and Filtering




CS267

Chris Pollett

Dec 4, 2023

Outline

Introduction

Map Reduce Using Hadoop

Word Count as a Hadoop Map Reduce job

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount
{
    /* This inner class is used for our Map job
       First two params to Mapper are types of KeyIn ValueIn
       last two are types of KeyOut, ValueOut
     */
    public static class WCMapper extends Mapper<Object,Text,Text,IntWritable>
    {
        public void map(Object key, Text value, Context context)
            throws IOException, InterruptedException
        {
            // normalize document case, get rid of non word chars
            String document = value.toString().toLowerCase()
                .replaceAll("[^a-z\\s]", "");
            String[] words = document.split(" ");
            for (String word : words) {
                Text textWord = new Text(word);
                IntWritable one = new IntWritable(1);
                context.write(textWord, one);
            }
        }
    }
    /* This inner class is used for our Reducer job
       First two params to Reducer are types of KeyIn ValueIn
       last two are types of KeyOut, ValueOut
     */
    public static class WCReducer extends Reducer<Text, IntWritable,
        Text, IntWritable>
    {
        public void reduce(Text key, Iterable<IntWritable> values,
            Context context) throws IOException, InterruptedException
        {
            int sum = 0;
            IntWritable result = new IntWritable();
            for (IntWritable val: values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }
    public static void main (String[] args) throws Exception
    {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "wordcount");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(WCMapper.class);
        job.setReducerClass(WCReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        TextInputFormat.addInputPath(job, new Path(args[0]));
        Path outputDir = new Path(args[1]);
        FileOutputFormat.setOutputPath(job, outputDir);
        FileSystem fs = FileSystem.get(conf);
        fs.delete(outputDir, true);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Compiling and Executing WordCount.java on Hadoop

Quiz

Which of the following is true?

    1. In TrafficRank, we give web crawl seeds sites an initial quantity of money, then download those pages and reassign money to their links.
    2. Both PageRank and HITs rely on the power method to compute document scores.
    3. SALSA makes use of a teleporter matrix.

Long-Term Recurring Information Needs

Categorization and Filtering

Approaches to Categorization and Filtering

Topic-Oriented Batch Filtering

Example of Topic Oriented Filtering

Evaluating Topic Oriented Filtering

Issues With Evaluating the Result

Size-Specific Precision at `k`

Size Specific Precision Results

Aggregate Precision at `k`

Aggregate Precision Results

Online Filtering

Aggregate precision score comparisons for online filtering

Historical Collection Statistics

Online Filtering with Historical Collection Statistics -- What to Store

Online Filtering Using Historical Info

Historical Training Examples

Language Categorization - Filtering

Language filtering results

Language Categorization - Categorization

Language filtering results
div class="slide">

On-line Adaptive Spam Filtering

Spam filtering process

Framework for Categorization

BM25 Applied to On-line Adaptive Email Filtering

BM25 Applied to Adaptive Email Filtering -cont'd

Evaluation

Email Filter Evaluation 1
Email Filter Evaluation 2

Choosing a threshold