Index Types, VSM Ranking




CS267

Chris Pollett

Sep 18, 2023

Outline

Introduction

Documents and Other Elements

Document-oriented Indexes

Measuring Retrieval Effectiveness

Recall and Precision

Document Statistics

When working with documents there are several common statistics which people typically keep track of:

`N_t` (document frequency)
the number of documents in the collection containing the term `t`.
`f_(t,d)` (term frequency)
the number of occurrences of the term `t` in the document `d`.
`l_d` (document length)
the length of document `d` measured in tokens.
`l_(avg)` (average length)
the average document length across the collection.
`N` (document count)
the total number of documents in the collection.

Also, when working with document-oriented indexes it is common to support coarser grained methods in our ADT, such as firstDoc(`t`), lastDoc(`t`), nextDoc(`t`, `mbox(current)`), and prevDoc(`t`, `mbox(current)`). The idea of a method like nextDoc, is that it returns the first document with the term `t` after `current` in the corpus. i.e., we don't care about position in the document with this method.

Posting Lists and Index Types

Ranking and Retrieval

The Vector Space Model

Cosine Similarity

Quiz

Which of the following is true?

  1. Galloping/Exponential search makes use of binary search.
  2. The binary search implementation of prev() is always faster than the sequential scan implementation of prev().
  3. The PHP syntax $x = file("my.dat"); can be used to read the file my.dat into a single string $x.

TF-IDF

Algorithm to Compute Cosine Rank

rankCosine(t[1],...t[n], k) 
// t is an array of query terms
// k is the number of documents we want to return
{
    j := 1
    d := min_(1 <= i <= n) nextDoc(t[i], - infty)
    //we only need to consider docs containing at least one term
    while d < infty do
        Result[j].docid := d;
        Result[j].score := sim(vec d, vec t)
        j++;
        d := min_(1 <= i <= n, nextDoc(t[i], d))
    sort Result by score;
    return Result[1 .. k];
}