Document-Oriented Indexes; the Vector Space Model




CS267

Chris Pollett

Sep. 14, 2011

Outline

Document-oriented Indexes

Document Statistics

When working with documents there are several common statistics which people typically keep track of:

`N_t` (document frequency)
the number of documents in the collection containing the term `t`.
`f_(t,d)` (term frequency)
the number of occurrences of the term `t` in the document `d`.
`l_d` (document length)
the length of document `d` measured in tokens.
`l_(avg)` (average length)
the average document length across the collection.
`N` (document count)
the total number of documents in the collection.

Also, when working with document-oriented indexes it is common to support coarser grained methods in our ADT, such as firstDoc(`t`), lastDoc(`t`), nextDoc(`t`, `mbox(current)`), and prevDoc(`t`, `mbox(current)`). The idea of a method like nextDoc, is that it returns the first document with the term `t` after `current` in the corpus. i.e., we don't care about position in the document with this method.

Posting Lists and Index Types

Ranking and Retrieval

The Vector Space Model

Cosine Similarity

TF-IDF

Algorithm to Compute Cosine Rank

rankCosine(t[1],...t[n], k) 
// t is an array of query terms
// k is the number of documents we want to return
{
    j := 1
    d := min_(1 <= i <= n, nextDoc(t[i], - infty))
    //we only need to consider docs containing at least one term
    while d < infty do
        Result[j].docid := d;
        Result[j].score := sim(vec d, vec q);
        j++;
        d := min_(1 <= i <= n, nextDoc(t[i], d))
    sort Result by score;
    return Result[1 .. k];
}