More Parallel Information Retrieval




CS267

Chris Pollett

Dec. 5, 2011

Outline

Introduction

Document Partitioning

How many documents should the servers return?

How many docs to return? (cont'd)

Graph of k versus m

Quiz

Which of the following is true?

  1. Jelinek-Mercer language modeling involves taking the language model for a document and then pretending to add `mu` documents to this model according to an underlying language model for the corpus.
  2. The derivation of the LMD relevance measure makes use of random walks through documents.
  3. Naively partitioning the index by term is more likely to create node hot-spots than partitioning the index by document.

Term Partitioning

Query Processing with Term Partitioning

Drawbacks of Term Partitioning/ Hybrid approach

MapReduce

The Basic Framework

Distinct Phases of a MapReduce Job

Example MapReduce Job for Counting

Example map reduce job for counting

Parallelizing Map Reduce

Combiners

Fault Tolerance