VSM, Proximity Ranking




CS267

Chris Pollett

Feb 21, 2022

Outline

Introduction

Document Statistics

When working with documents there are several common statistics which people typically keep track of:

`N_t` (document frequency)
the number of documents in the collection containing the term `t`.
`f_(t,d)` (term frequency)
the number of occurrences of the term `t` in the document `d`.
`l_d` (document length)
the length of document `d` measured in tokens.
`l_(avg)` (average length)
the average document length across the collection.
`N` (document count)
the total number of documents in the collection.

Also, when working with document-oriented indexes it is common to support coarser grained methods in our ADT, such as firstDoc(`t`), lastDoc(`t`), nextDoc(`t`, `mbox(current)`), and prevDoc(`t`, `mbox(current)`). The idea of a method like nextDoc, is that it returns the first document with the term `t` after `current` in the corpus. i.e., we don't care about position in the document with this method.

Posting Lists and Index Types

Ranking and Retrieval

The Vector Space Model

Cosine Similarity

Quiz

Which of the following is true?

  1. The binary search implementation of next() is always faster than the sequential scan implementation of next().
  2. The PHP syntax $x = file("my.dat"); can be used to read the file my.dat into a single string $x.
  3. Galloping/Exponential search makes use of binary search.

TF-IDF

Algorithm to Compute Cosine Rank

rankCosine(t[1],...t[n], k) 
// t is an array of query terms
// k is the number of documents we want to return
{
    j := 1
    d := min_(1 <= i <= n) nextDoc(t[i], - infty)
    //we only need to consider docs containing at least one term
    while d < infty do
        Result[j].docid := d;
        Result[j].score := sim(vec d, vec t)
        j++;
        d := min_(1 <= i <= n, nextDoc(t[i], d))
    sort Result by score;
    return Result[1 .. k];
} 

Proximity Ranking

Algorithm for Finding Covers

nextCover(t[1],.., t[n], position) 
{
    v:= max_(1≤ i ≤ n)(next(t[i], position));
    if(v == infty)
        return [ infty, infty];
    u := min_(1≤ i ≤ n)(prev(t[i], v+1))
    if(docid(u) == docid(v) ) then 
        return [u,v]; 
       // covers need to be in the same document
    else
        return nextCover(t[1],.., t[n], u);
}

Ranking Covers