CS267
Chris Pollett
Sep. 14, 2011
When working with documents there are several common statistics which people typically keep track of:
Also, when working with document-oriented indexes it is common to support coarser grained methods in our ADT, such as firstDoc(`t`), lastDoc(`t`), nextDoc(`t`, `mbox(current)`), and prevDoc(`t`, `mbox(current)`). The idea of a method like nextDoc, is that it returns the first document with the term `t` after `current` in the corpus. i.e., we don't care about position in the document with this method.
rankCosine(t[1],...t[n], k)
// t is an array of query terms
// k is the number of documents we want to return
{
j := 1
d := min_(1 <= i <= n, nextDoc(t[i], - infty))
//we only need to consider docs containing at least one term
while d < infty do
Result[j].docid := d;
Result[j].score := sim(vec d, vec q);
j++;
d := min_(1 <= i <= n, nextDoc(t[i], d))
sort Result by score;
return Result[1 .. k];
}