CS267
Chris Pollett
Sep. 12, 2012
When working with documents there are several common statistics which people typically keep track of:
Also, when working with document-oriented indexes it is common to support coarser grained methods in our ADT, such as firstDoc(`t`), lastDoc(`t`), nextDoc(`t`, `mbox(current)`), and prevDoc(`t`, `mbox(current)`). The idea of a method like nextDoc, is that it returns the first document with the term `t` after `current` in the corpus. i.e., we don't care about position in the document with this method.
Problem 2.3. Using the methods of the inverted index ADT, write an algorithm that locates all intervals corresponding to speeches (<SPEECH>...</SPEECH>). Assume the schema-independent indexing shown in Figure 2.1, as illustrated by Figure 1.3.
Solution. We are assuming speech tags can't be nested. We can take the algorithm we had from class for nextPhrase(t[1],t[2], .., t[n], position) and try to modify it. If we just called this algorithm on nextPhrase(<SPEECH>,</SPEECH>, position), it would give us the next document that has the term <SPEECH> immediately followed by </SPEECH>. So it gets the sequence of tags right, but doesn't allow for terms other than SPEECH terms to be contained between these tags. To allow for this we modify the v-u == n - 1 check of the original algorithm. Hence, we get the following:
nextSpeech(position) { t[1] = <SPEECH> t[2] = </SPEECH> n = 2 v:=position for i = 1 to n do v:= next(t[i], v) if v == infty then // infty represents after the end of the posting list return [infty, infty] u := v for i := n-1 downto 1 do u := prev(t[i],u) return [u, v] }
One could probably get away without a prev call in this case. Given nextSpeech(position), we can output all occurrences with the algorithm:
position = - infty while(position < infty) { [u, v] = nextSpeech(position) report [u, v] // output this occurrence (how indicate we located it) position = u }
rankCosine(t[1],...t[n], k) // t is an array of query terms // k is the number of documents we want to return { j := 1 d := min_(1 <= i <= n) nextDoc(t[i], - infty) //we only need to consider docs containing at least one term while d < infty do Result[j].docid := d; Result[j].score := sim(vec d, vec q); j++; d := min_(1 <= i <= n, nextDoc(t[i], d)) sort Result by score; return Result[1 .. k]; }