More Parallel Information Retrieval




CS267

Chris Pollett

Nov 28, 2012

Outline

Introduction

Document Partitioning

How many documents should the servers return?

How many docs to return? (cont'd)

Graph of k versus m

HW Problem

Exercise 9.2. Show that models resulting from Dirichlet smoothing can be treated as probability distributions. That is show that `sum_(t in V)M_d^(\mu)(t) = 1`.

Answer.

`sum_(t in V)M_d^(\mu)(t) = sum_(t in V) frac(f_(t,d) + mu M_C(t))(l_d + mu)`
`sum_(t in V) frac(f_(t,d))(l_d + mu) + sum_(t in V) frac(mu M_C(t))(l_d + mu)`

If we sum `f_(t,d)` over all documents by definition we get `l_d`. Also `M_C` is a probability distribution on terms, so summing over all terms give `1`. Hence the left hand sum above evaluates to: `frac(l_d)(l_d + mu)`
and the right hand sum above evaluates to
`frac(mu)(l_d + mu)`
adding these two fractions gives the desired value of `1`.

Term Partitioning

Query Processing with Term Partitioning

Drawbacks of Term Partitioning/ Hybrid approach

MapReduce

The Basic Framework

Distinct Phases of a MapReduce Job

Example MapReduce Job for Counting

Example map reduce job for counting

Parallelizing Map Reduce

Combiners

Fault Tolerance