VSM, Proximity Ranking




CS267

Chris Pollett

Feb 22, 2021

Outline

Introduction

Document Statistics

When working with documents there are several common statistics which people typically keep track of:

`N_t` (document frequency)
the number of documents in the collection containing the term `t`.
`f_(t,d)` (term frequency)
the number of occurrences of the term `t` in the document `d`.
`l_d` (document length)
the length of document `d` measured in tokens.
`l_(avg)` (average length)
the average document length across the collection.
`N` (document count)
the total number of documents in the collection.

Also, when working with document-oriented indexes it is common to support coarser grained methods in our ADT, such as firstDoc(`t`), lastDoc(`t`), nextDoc(`t`, `mbox(current)`), and prevDoc(`t`, `mbox(current)`). The idea of a method like nextDoc, is that it returns the first document with the term `t` after `current` in the corpus. i.e., we don't care about position in the document with this method.

Posting Lists and Index Types

Ranking and Retrieval

The Vector Space Model

Cosine Similarity

TF-IDF

Algorithm to Compute Cosine Rank

rankCosine(t[1],...t[n], k) 
// t is an array of query terms
// k is the number of documents we want to return
{
    j := 1
    d := min_(1 <= i <= n) nextDoc(t[i], - infty)
    //we only need to consider docs containing at least one term
    while d < infty do
        Result[j].docid := d;
        Result[j].score := sim(vec d, vec t)
        j++;
        d := min_(1 <= i <= n, nextDoc(t[i], d))
    sort Result by score;
    return Result[1 .. k];
} 

Quiz

Which of the following is true?

  1. The PHP syntax $x = file("my.dat"); can be used to read the file my.dat into a single string $x.
  2. The binary search implementation of next() is always faster than the sequential scan implementation of next().
  3. Galloping/Exponential search makes use of binarySearch.

Proximity Ranking

Algorithm for Finding Covers

nextCover(t[1],.., t[n], position) 
{
    v:= max_(1≤ i ≤ n)(next(t[i], position));
    if(v == infty)
        return [ infty, infty];
    u := min_(1≤ i ≤ n)(prev(t[i], v+1))
    if(docid(u) == docid(v) ) then 
        return [u,v]; 
       // covers need to be in the same document
    else
        return nextCover(t[1],.., t[n], u);
}

Ranking Covers