More Evaluating Results, Token and Term Processing




CS267

Chris Pollett

Sep. 19, 2012

Outline

F-measures

Measures for the first `k` results

Example Precision Recall Plot from the Book

Average Precision

Precision at k and MAP scores for different ranking schemes

HW Problem

Exercise 2.8. Demonstrate that the term vector `(t_1, ..., t_n)` may have at most `n cdot l` covers (see Section 2.2.2), where `l` is the length of the shortest posting list for the terms in the vector.

Answer. On Page 60 of the book, the example where one has a test collection in which all the terms occur in the same order, the same number of times, is given. i.e.,
`...t_1 ... t_2 ... t_3 ... t_n ...t_1 ... t_2 ... t_3 ... t_n ... t_1 ...`
Since any sequence of `n` terms in the above is a cover, and the number of times we repeat is `l`, this shows `Omega(n cdot l)` might be required. We are tasked with finding a matching upper bound. To do this it suffices to show that an occurrence of a term, whose posting list is shortest amongst all the terms in the term vector, appears in at most `n` covers. Since such a term must be in every cover, and it occurs only `l` times in the corpus, this would give the `n cdot l` bound. Suppose there were `n+1` covers around an occurrence at location `p` of such a term `t`. Denote them `[L_1, R_1], ... [L_(n+1), R_(n+1)]` where `L_i` denotes the left side of the `i`th interval; `R_i` denote the right side of the `i`th interval. So for each `i`, `p in [L_i, R_i]` and `|R_i - L_i| ge n`. Let `R_j ge p` be the least value among `R_i`. Then `L_j` must also be the least value among the `L_j`, for if `L_k` were less than `L_j`, then `[L_k, R_k]` would not be a cover as it would contain `[L_j, R_j]`. We can perform a similar argument on the next to smallest `R_(j')` to argue `L_(j')` must be the next to smallest `L_i`. Hence, we can sort the intervals into intervals `[L_(i_v), R_(i_(v))]` so that `L_(i_1) < L_(i_2)... < L_(i_n+1)` and `R_(i_1) < R_(i_2)... < R_(i_n+1)`. The term that the cover `[L_(i_1), R_(i_1)]` begins with cannot be the same as the one `[L_(i_2), R_(i_2)]` begins with, or it would not be a cover, as you could use `[L_(i_2), R_(i_1)]` instead. As there are only `n` terms, by the pigeonhole principle, two of these supposed `n+1` many covers share the same first term, `t'`. Let's call these covers `[L_j, R_j]` and `[L_k, R_k]`. We can assume `L_j < L_k le p` so `[L_j, R_j]` is not a cover as we could increase `L_j` and not lose `t'`. So we have a contradiction from assuming that we could have `n+1` covers.

Building a Test Collection

Efficiency Measures

Token and Terms

Punctuation and Capitalization

Stemming