CS267
Chris Pollett
Sep. 14, 2011
When working with documents there are several common statistics which people typically keep track of:
Also, when working with document-oriented indexes it is common to support coarser grained methods in our ADT, such as firstDoc(`t`), lastDoc(`t`), nextDoc(`t`, `mbox(current)`), and prevDoc(`t`, `mbox(current)`). The idea of a method like nextDoc, is that it returns the first document with the term `t` after `current` in the corpus. i.e., we don't care about position in the document with this method.
rankCosine(t[1],...t[n], k) // t is an array of query terms // k is the number of documents we want to return { j := 1 d := min_(1 <= i <= n, nextDoc(t[i], - infty)) //we only need to consider docs containing at least one term while d < infty do Result[j].docid := d; Result[j].score := sim(vec d, vec q); j++; d := min_(1 <= i <= n, nextDoc(t[i], d)) sort Result by score; return Result[1 .. k]; }