More Parallel Information Retrieval




CS267

Chris Pollett

Nov 28, 2018

Outline

Introduction

Document Partitioning

How many documents should the servers return?

How many docs to return? (cont'd)

Graph of k versus m

In-Class Exercise

Suppose we have 12 machines and we want to return the top 50 documents with probability at least .999, what number of documents should each machine compute their best results till? You can use the chart to try to figure this out even though there is no twelve machine case.

Post your estimate and how you obtained it to the Nov 28 In-Class Exercise Thread.

Bottlenecks of Document Partitioning

Query Processing with Term Partitioning

Drawbacks of Term Partitioning/ Hybrid approach

MapReduce

The Basic Framework

Distinct Phases of a MapReduce Job

Example MapReduce Job for Counting

Example map reduce job for counting

Parallelizing Map Reduce

Combiners

Fault Tolerance