More Parallel Information Retrieval




CS267

Chris Pollett

Nov 13, 2019

Outline

Introduction

Document Partitioning

How many documents should the servers return?

How many docs to return? (cont'd)

Graph of k versus m

In-Class Exercise

Suppose we have 24 machines and we want to return the top 20 documents with probability at least .999, what number of documents should each machine compute their best results till? You can use the chart to try to figure this out even though there is no 24 machine case.

Post your estimate and how you obtained it to the Nov 13 In-Class Exercise Thread.

Bottlenecks of Document Partitioning

Query Processing with Term Partitioning

Drawbacks of Term Partitioning/ Hybrid approach

MapReduce

The Basic Framework

Distinct Phases of a MapReduce Job

Example MapReduce Job for Counting

Example map reduce job for counting

Parallelizing Map Reduce

Combiners

Fault Tolerance