Introduction

Last Wednesday, we began looking at parallel information retrieval.
We were interested in techniques for distributing query processing and indexing across several machines (nodes).
We began by looking at query processing. We studied inter- (replication-based) and intra- (partitioning-bassed) query methods to handle queries in parallel.
We then focused on the latter class of methods and had just introduced the notions of document and term partitionings of the index.
We now look at these in more detail.

Document Partitioning

In a document-partitioned index, each index server is responsible for a subset of the documents.
There is a main server (the book calls a receptionist) that when presented a query forwards it to each partition server. These `n` servers then compute the top `k` query results on their partition and forward the answer back to the reception who merges the results, selecting the top `m`.
If the search engine maintains a dynamic index that allows updates, it may even be possible to carry out the updates in a distributed fashion in which each node takes care of the updates that pertain to its part of the overall index.
When assigning documents to a node, one typically uses some kind of hash of the document or the document's url to determine which server is responsible for the document's contents.

How many documents should the servers return?

Given `n`, `k`, and `m` as on the last slide, one important thing to determine is: How many documents (k) does each of the `n` servers need to send to the receptionist to guarantee that the receptionist has the top `m` documents in the index in what it gets from the servers?
Clarke and Terra (2004) analysed this problem based on the assumption that each document was assigned to a random index node when the collection was split into `n` subsets.
So each node is equally like to return the best, second-best, third-best, etc. result.
Let `R_m = {r_1, r_2, ... r_m}` be the set consisting of the top `m` search results. For each `r_i` the probability that it is found by a particular node is `1/n`.
Hence, the probability that exactly `l` of the top `m` results are found by that node is given by the binomial distribution:
`b(n,m, l) = ((m),(l)) cdot (1/n)^l cdot(1 - 1/n)^(m-l)`

How many docs to return? (cont'd)

The probability that members of `R_m` are covered by requesting the top `k` results from each of the `n` index nodes can be calculated according to the following recursive formula:
`p(n,m,k) = { (1, if m le k), (0, if m >k text{ and } n=1), (sum_(l=0)^k b(n,m,l)cdot(p(n-1, m-l,k)), if m >k text( and ) n > 1) :}`
This doesn't give seem to have an easy closed form. Nevertheless, the values of `k` versus `m` can be plotted for fixed `n` and look like:

Quiz

Which of the following is true?

Jelinek-Mercer language modeling involves taking the language model for a document and then pretending to add `mu` documents to this model according to an underlying language model for the corpus.
The derivation of the LMD relevance measure makes use of random walks through documents.
Naively partitioning the index by term is more likely to create node hot-spots than partitioning the index by document.

Term Partitioning

Document partitioning works best when the index data on the individual nodes can be stored in main memory or on SSD.
To see what happens when data is stored on disk, suppose queries were on average 3 words and we want the search engine to handle 100 queries per second.
Due to queueing effects, we cannot achieve a utilization of more than 50% typically, without experiencing latency jumps.
So a query load of 100qps translates to a require service rate of at least 200 qps. For 3 word queries, this translates to at least 600 random access operations per second.
Assuming an average disk latency of 10ms, a single hard disk drive cannot perform more than 100 random access operations per second, one sixth of what we need on each of our node.
Adding more machines doesn't affect the minimum of what each machine must do., so this is a bottleneck for the document partitioning approach.
Term partitioning addresses this problem by splitting the collection into sets of terms and assigning nodes to each of these sets.
This resolves the problem above because, each node won't be responsible for handling every query.

Query Processing with Term Partitioning

Suppose a query `q` contains terms `t_1, ... t_q`. Then the receeptionist will forward the query to node `v(t_1)` responsible for the term `t_1`.
After creating a set of document score accumulators from `t_1`'s posting list, `v(t_1)` forwards the query, along with the accumulator set to the vode `v(t_2)` which continues the process and so on.
Finally, `v(t_q)`, sends the final accumulator set to the receptionist where the top `m` results are selected.
We can process infrequent terms first, and do accumulator pruning in a similar fashion to the single machine set-up.
Notice as each query -- at a given point in time -- is processed by only a single node, this scheme as described above does not use intra-query parallelism, so will not reduce the response time.
This can problem can be mitigated to some degree by the receptionist sending pre-fetch instructions to the later `v(t_i)`'s while `v(t_1)` does its calculations.

Drawbacks of Term Partitioning/ Hybrid approach

The three main drawbacks to using term partitioning are: scalability -- postings lists tend to get longer as the index grows and thus so does the time for each pipeline step; load imbalance -- some terms tend to be queried more than others causing some machines to be overloaded compared to others; term-at-a-time -- this method works okay for term at a time relevance measure such as BM25, but not for methods which need to be computed a document-at-a-time like proximity ranking.
Two ways to get the benefits of both worlds (document and term partitioning) are to split the collection by both term and document across the servers or to use replication.
The first approach involves a fair bit coding but can work.
Alternatively, one can just use document partitioning and replicate each document partition to reduce the number of seeks any one partition needs to handle. This also improves fault tolerance.

MapReduce

Apart from processing search queries, there are many other data-intensive tasks that need to be carried out by a search engine. For instance: building and updating the index, identifying duplicate-documents, analysing the link structure of the document, etc.
MapReduce is a framework developed at Google for doing these massively parallel computations on very large datasets.
The original paper in which it was described is Dean and Ghemawat (2004).

The Basic Framework

MapReduce was inspired by the map and reduce functions found in functional programming languages such as Lisp.
The map function takes as its argument a function `f` and a list of elements `l = langle l_1, ..., l_n rangle`. It return a new list
`map(f,l) = langle f(l_1), ..., f(l_n) rangle`.
The reduce function takes a function `g` and a list of elements `l = langle l_1, ..., l_n rangle`. It returns a new element `l'` such that
`l' = \r\e\d\u\c\e(g,l) = g(l_1,g(l_2, g(l_3, ... )))`.
From a high-level view, a MapReduce programs creates a sequence of key/value pairs, performs some computations on them, and outputs another sequence of key/value pairs.
Keys and values are often strings, but may be of any data type.

Distinct Phases of a MapReduce Job

MapPhase, key/value pairs are read from the input and the map function is applied to each of them individually. The function is of the general form:
`map: langle k, v rangle |-> langle langle k_1, v_1 rangle, langle k_2, v_2 rangle, ...,rangle`
Shuffle Phase, the pairs produced during the map phase are sorted by their key, and all values for the same key are grouped together.
Reduce Phase, the reduce function is applied to each key and its values.
`\r\e\d\u\c\e: langle k, langle v_1, v_2, ... rangle rangle |-> langle k, langle v_1', v_2', ... rangle rangle`
That is, for each key the reduce function processes the list of associated values and outputs another list of values. The output values and their number may not be the same as the input values.

Example MapReduce Job for Counting

Parallelizing Map Reduce

Both Map and Reduce jobs can be parallelized, that is, executed on many machines.
Map jobs are typically broken into smaller pieces called map shards each of 16 or 64 MB of data.
These are treated independently.
The output of these then is broken into reduce shards and sent to reduce workers.
Assignment of a key-value to a given reduce shard can be done by applying a hash value to its key.

Combiners

In many MapReduce jobs, a single map shard may produce a large number of key/value pairs for the same key. For example, the word "the" might account for 6-7% of the output in our word counting example.
Forwarding all these tuples to the reduce work responsible for "the" wastees network and storage resources and also causes load imbalances.
To overcome this problem each map work can also do the shuffle/reduce phase on its portion then send the result to the corresponding reduce worker.
This kind of reduce that is applied to a map shard is called a combiner.

Fault Tolerance

If a machine computing a map shard fails the key-values can be sent to a new machine and that shard can be recomputed without having to recompute any of the other map shard jobs.
At the reduce level, a given reduce shard might depend on keys from several map jobs.
To prevent having to recompute each of these dependencies on failure, typcially, the inputs to a reduce shard are stored in distributed file system, so in the event of a failure they can be re-read from there.

More Parallel Information Retrieval

Outline

Introduction

Document Partitioning

How many documents should the servers return?

How many docs to return? (cont'd)

Quiz

Term Partitioning

Query Processing with Term Partitioning

Drawbacks of Term Partitioning/ Hybrid approach

MapReduce

The Basic Framework

Distinct Phases of a MapReduce Job

Example MapReduce Job for Counting

Parallelizing Map Reduce

Combiners

Fault Tolerance