Finish Proximity Ranking, Boolean Retrieval

CS267

Chris Pollett

Feb 24, 2021

Outline

Finish Proximity Ranking
In-Class Exercise
Boolean Retrieval

Introduction

On Monday, we defined, given a term vector `langle t_1, ... t_n rangle`, a cover for this vector to be an interval `[u,v]` in the collection that contains all the terms in the vector, such that no smaller interval `[u', v']` contained in `[u,v]` also has a match to all the terms in the vector.
We gave an algorithm nextCover(t[1],.., t[n], position) which determines the next location of a term vector cover after a position.
Given a document `d` with covers `[u_1, v_1]`, `[u_2, v_2]`, ... for some term vector `langle t_1, ... t_n rangle`, we defined its proximity score to be:
`mbox(score)(d) = sum(frac(1)(v_i - u_i + 1))`.
We begin today by looking at an algorithm to rank documents according to their proximity score.

Ranking Algorithm with Proximity Scores

rankProximity(t[1],.., t[n], k)
// t[] term vector
// k number of results to return 
{
    u := - infty;
    [u,v] := nextCover(t[1],.., t[n], u);
    d := docid(u);
    score := 0;
    j := 0;
    while( u < infty) do
        if(d < docid(u) ) then
        // if docid changes record info about last docid
            j := j + 1;
            Result[j].docid := d;
            Result[j].score := score;
            d := docid(u);
            score := 0;
        score := score + 1/(v - u + 1);
        [u, v] := nextCover(t[1],.., t[n], u);
    if(d < infty) then
        // record last score if not recorded
        j := j + 1;
        Result[j].docid := d;
        Result[j].score := score;
    sort Result[1..j] by score;
    return Result[1..k];   
}

Using an analysis similar to that used for galloping search in the book, you can prove this algorithm has running time:
`O(n^2 l cdot log(L/l))`.

In-Class Exercise

Suppose we have two documents `d_1`="I want to be able to go" and `d_2` = "to be or not to be".
Work out step by step how the call rankProximity("to", "be", 2) would process this corpus.
Please post your solution to the Feb 24 In-Class Exercise Thread.

Boolean Retrieval

Besides being used as implicit filters in some web search engines, explicit support for Boolean queries is important for some specific applications such as digital libraries and in the legal domain.
For example, when we do a patent search, we want an exhaustive list of patents returned which match the term query, not just the top 10.
In the Boolean retrieval model we thus return sets of results rather than ranked lists.
The standard boolean operators AND, OR, and NOT are used to construct queries.
Here "A AND B" means the intersection of sets A and B; "A OR B" means the union of the sets A and B; and NOT A means those documents in the collection not contained in A.
A term `t` corresponds in a query to the collection of documents containing `t`.
So ("quarrel" OR "sir") AND "you" represents the set of documents that contain "you" and also contain at least one of the two terms "quarrel" and "sir".
An algorithm to solve a Boolean query locates candidate solutions to the query, where each candidate solution represents a range of documents that together satisfy the query.

Extending Our ADT for Boolean Retrieval

To simplify our Boolean search algorithm, we define two functions that operate over Boolean queries:

docRight(Q, u)
end point of the first candidate solution to `Q` starting after document `u`

docLeft(Q, v)
start point of the last candidate solution to `Q` ending before document `v`
For terms we define docRight(t, u) := nextDoc(t, u) and docLeft(t,v) := prevDoc(t,v).

For AND and OR operators we define:

docRight(A AND B, u) := max(docRight(A, u), docRight(B,u))
docLeft(A AND B, v) := min(docLeft(A,v), docLeft(B,v))
docRight(A OR B, u) := min(docRight(A, u), docRight(B,u))
docLeft(A OR B, v) := max(docLeft(A,v), docLeft(B,v))

Notice the above rules give us a recursive algorithm given a Boolean query `Q` to compute docRight, docLeft.

Algorithm to Return the Next Solution to a Positive Boolean Query (No NOT's).

nextSolution(Q, position)
{
    v := docRight(Q, position);
    if v = infty then
        return infty;
    u := docLeft(Q, v+1);
    if(u == v) then
        return u;
    else
        return nextSolution(Q, v);
}

Algorithm to Return All Solutions to a Positive Boolean Query

u :=  -infty
while u < infty do
    u := nextSolution(Q, u);
    if(u < infty) then
        report docid(u);

If we implement nextDoc, prevDoc with galloping search, the complexity of this algorithm is `O(n cdot l cdot log(L/l))`

Handling Queries with NOT

To handle negations we first use De Morgan's rules to push negations so that they are only on terms.
i.e., NOT (A AND B ) = NOT A OR NOT B and NOT (A OR B) = NOT A AND NOT B
Then we enhance our methods nextDoc and prevDoc, so that they can directly handle negations. i.e., So nextDoc(NOT t, u) and prevDoc(NOT t, u) make sense.
Finally, we set docRight(NOT t, u) := nextDoc(NOT t, u) and docLeft(NOT t,v) := prevDoc(NOT t,v).