More VSM, Proximity Ranking, Boolean Retrieval




CS267

Chris Pollett

Sep 17, 2019

Outline

Introduction

TF-IDF

Algorithm to Compute Cosine Rank

rankCosine(t[1],...t[n], k) 
// t is an array of query terms
// k is the number of documents we want to return
{
    j := 1
    d := min_(1 <= i <= n) nextDoc(t[i], - infty)
    //we only need to consider docs containing at least one term
    while d < infty do
        Result[j].docid := d;
        Result[j].score := sim(vec d, vec t)
        j++;
        d := min_(1 <= i <= n, nextDoc(t[i], d))
    sort Result by score;
    return Result[1 .. k];
} 

Proximity Ranking

Algorithm for Finding Covers

nextCover(t[1],.., t[n], position) 
{
    v:= max_(1≤ i ≤ n)(next(t[i], position));
    if(v == infty)
        return [ infty, infty];
    u := min_(1≤ i ≤ n)(prev(t[i], v+1))
    if(docid(u) == docid(v) ) then 
        return [u,v]; 
       // covers need to be in the same document
    else
        return nextCover(t[1],.., t[n], u);
}

Ranking Covers

Ranking Algorithm with Proximity Scores

rankProximity(t[1],.., t[n], k)
// t[] term vector
// k number of results to return 
{
    u := - infty;
    [u,v] := nextCover(t[1],.., t[n], u);
    d := docid(u);
    score := 0;
    j := 0;
    while( u < infty) do
        if(d < docid(u) ) then
        // if docid changes record info about last docid
            j := j + 1;
            Result[j].docid := d;
            Result[j].score := score;
            d := docid(u);
            score := 0;
        score := score + 1/(v - u +1);
        [u, v] := nextCover(t[1],.., t[n], u);
    if(d < infty) then
        // record last score if not recorded
        j := j + 1;
        Result[j].docid := d;
        Result[j].score := score;
    sort Result[1..j] by score;
    return Result[1..k];   
}
  

Using an analysis similar to that used for galloping search in the book, you can prove this algorithm has running time:
`O(n^2 l cdot log(L/l))`.

Quiz

Which of the following is true?

  1. The posting list next() operation should always be implemented using a binary search between first() and last().
  2. Using a frequency index we can determine the number of times a term occurred in a document.
  3. The vector space model uses sine similarity to determine how close a query is to a document.

Boolean Retrieval

Extending Our ADT for Boolean Retrieval

Algorithm to Return the Next Solution to a Positive Boolean Query (No NOT's).

nextSolution(Q, position)
{
    v := docRight(Q, position);
    if v = infty then
        return infty;
    u := docLeft(Q, v+1);
    if(u == v) then
        return u;
    else
        return nextSolution(Q, v);
}

Algorithm to Return All Solutions to a Positive Boolean Query

u :=  -infty
while u < infty do
    u := nextSolution(Q, u);
    if(u < infty) then
        report docid(u);

If we implement nextDoc, prevDoc with galloping search, the complexity of this algorithm is `O(n cdot l cdot log(L/l))`

Handling Queries with NOT