Document-Oriented Indexes; the Vector Space Model




CS267

Chris Pollett

Sep. 12, 2012

Outline

Document-oriented Indexes

Document Statistics

When working with documents there are several common statistics which people typically keep track of:

`N_t` (document frequency)
the number of documents in the collection containing the term `t`.
`f_(t,d)` (term frequency)
the number of occurrences of the term `t` in the document `d`.
`l_d` (document length)
the length of document `d` measured in tokens.
`l_(avg)` (average length)
the average document length across the collection.
`N` (document count)
the total number of documents in the collection.

Also, when working with document-oriented indexes it is common to support coarser grained methods in our ADT, such as firstDoc(`t`), lastDoc(`t`), nextDoc(`t`, `mbox(current)`), and prevDoc(`t`, `mbox(current)`). The idea of a method like nextDoc, is that it returns the first document with the term `t` after `current` in the corpus. i.e., we don't care about position in the document with this method.

Posting Lists and Index Types

Ranking and Retrieval

Homework

Problem 2.3. Using the methods of the inverted index ADT, write an algorithm that locates all intervals corresponding to speeches (<SPEECH>...</SPEECH>). Assume the schema-independent indexing shown in Figure 2.1, as illustrated by Figure 1.3.

Solution. We are assuming speech tags can't be nested. We can take the algorithm we had from class for nextPhrase(t[1],t[2], .., t[n], position) and try to modify it. If we just called this algorithm on nextPhrase(<SPEECH>,</SPEECH>, position), it would give us the next document that has the term <SPEECH> immediately followed by </SPEECH>. So it gets the sequence of tags right, but doesn't allow for terms other than SPEECH terms to be contained between these tags. To allow for this we modify the v-u == n - 1 check of the original algorithm. Hence, we get the following:

nextSpeech(position)
{
   t[1] = <SPEECH>
   t[2] = </SPEECH>
   n = 2
   v:=position
   for i = 1 to n do
     v:= next(t[i], v)
   if v == infty then // infty represents after the end of the posting list
      return [infty, infty]
   u := v
   for i := n-1 downto 1 do
     u := prev(t[i],u)
   return [u, v] 
}

One could probably get away without a prev call in this case. Given nextSpeech(position), we can output all occurrences with the algorithm:

position = - infty
while(position < infty)
{
    [u, v] = nextSpeech(position)
    report [u, v] // output this occurrence (how indicate we located it)
    position = u
}

The Vector Space Model

Cosine Similarity

TF-IDF

Algorithm to Compute Cosine Rank

rankCosine(t[1],...t[n], k) 
// t is an array of query terms
// k is the number of documents we want to return
{
    j := 1
    d := min_(1 <= i <= n) nextDoc(t[i], - infty)
    //we only need to consider docs containing at least one term
    while d < infty do
        Result[j].docid := d;
        Result[j].score := sim(vec d, vec q);
        j++;
        d := min_(1 <= i <= n, nextDoc(t[i], d))
    sort Result by score;
    return Result[1 .. k];
}