Outline
- Proximity Ranking
- Quiz
- Boolean Retrieval
- Evaluation of Results
Introduction
- Last week, we gave the vector space model as one model for selecting which documents to return
for a given term query and how to rank them.
- Today, we will look at two other models: proximity ranking and boolean retrieval.
- We will then look at different techniques for evaluating query results.
Proximity Ranking
- Results and ranking in the VSM depend only on term frequency, TF, and Inverse Document Frequency, IDF.
- The proximity ranking method in contrast depends explicitly on how close the terms in the term vector occur,
but only implicitly depends of TF and IDF plays no role at all.
- Given a term vector `langle t_1, ... t_n rangle`, we define a cover for this vector to be an interval
that `[u,v]` in the collection that contains all the terms in the vector, such that no smaller interval `[u', v']`
contained in `[u,v]` also has a match to all the terms in the vector.
- For example, if the document 1 in our collection began as "To be or not to be", and the term vector was "to be" then the intervals `[1:1, 1:2]` and `[1:5, 1:6]`
would be covers, but `[1:1, 1:3]` would not be as it contains `[1:1, 1:2]`.
- Covers are allowed to overlap. If "meow meow" was the term vector and document 2 began "meow meow meow meow meow meow ..." then
the intervals `[1:1, 1:2]` and `[1:2, 1:3]` would both be covers.
- A given token though can appear in at most `n cdot l` covers, where `l` is the length of the shortest posting list.
Algorithm for Finding Covers
nextCover(t[1],.., t[n], position)
{
v:= max_(1≤ i ≤ n)(next(t[i], position));
if(v == infty)
return [ infty, infty];
u := min_(1≤ i ≤ n)(prev(t[i], v+1))
if(docid(u) == docid(v) ) then
return [u,v];
// covers need to be in the same document
else
return nextCover(t[1],.., t[n], u);
}
Ranking Covers
- We now want to come up with a formula that scores documents according to the covers it contains for the
term query.
- We would like that smaller covers are worth more than larger covers.
- We would also like more covers in a document to count more than fewer covers in a document.
- Keeping this in mind, suppose document `d` has covers `[u_1, v_1]`, `[u_2, v_2]`, ...
- Then we define the score for `d` to be:
`mbox(score)(d) = sum(frac(1)(v_i - u_i + 1))`.
Ranking Algorithm with Proximity Scores
rankProximity(t[1],.., t[n], k)
// t[] term vector
// k number of results to return
{
u := - infty;
[u,v] := nextCover(t[1],.., t[n], u);
d := docid(u);
score := 0;
j := 0;
while( u < infty) do
if(d < docid(u) ) then
// if docid changes record info about last docid
j := j + 1;
Result[j].docid := d;
Result[j].score := score;
d := docid(u);
score := 0;
score := score + 1/(v - u +1);
[u, v] := nextCover(t[1],.., t[n], u);
if(d < infty) then
// record last score if not recorded
j := j + 1;
Result[j].docid := d;
Result[j].score := score;
sort Result[1..j] by score;
return Result[1..k];
}
Using an analysis similar to that used for galloping search in the book, you can prove this algorithm has running time:
`O(n^2 l cdot log(L/l))`.
Quiz
Which of the following is true?
- For any algorithm implementing the inverted index ADT, both the next() and prev() functions require time `Omega(n cdot L^2)`.
- Define IDF to be 0 if a term never appears. Assume that our corpus has at least one document. It is possible for TF-IDF, as defined in class, to be a negative number under these assumptions.
- The galloping search algorithm makes use of binary search.
Boolean Retrieval
- Besides being used as implicit filters in some web search engines, explicit support for Boolean queries is important for some specific applications such as digital libraries and in the legal domain.
- For example, when we do a patent search, we want an exhaustive list of patents returned which match the term query, not just the top 10.
- In the Boolean retrieval model we thus return sets of results rather than ranked lists.
- The standard boolean operators AND, OR, and NOT are used to construct queries.
- Here "A AND B" means the intersection of sets A and B; "A OR B" means the union of the sets A and B; and NOT A means those documents in the collection not contained in A.
- A term `t` corresponds in a query to the collection of documents containing `t`.
- So ("quarrel" OR "sir") AND "you" represents the set of documents that contain "you" and also contain at least one of the two terms "quarrel" and "sir".
- An algorithm to solve a Boolean query locates candidate solutions to the query, where each candidate solution represents a range of documents that together satisfy the query.
Extending Our ADT for Boolean Retrieval
- To simplify our Boolean search algorithm, we define two functions that operate over Boolean queries:
- docRight(Q, u)
- end point of the first candidate solution to `Q` starting after document `u`
- docLeft(Q, u)
- start point of the last candidate solution to `Q` ending before document `v`
- For terms we define docRight(t, u) := nextDoc(t, u) and docLeft(t,v) := prevDoc(t,v).
- For AND and OR operators we define:
docRight(A AND B, u) := max(docRight(A, u), docRight(B,u))
docLeft(A AND B, v) := min(docLeft(A,v), docLeft(B,v))
docRight(A OR B, u) := min(docRight(A, u), docRight(B,u))
docLeft(A OR B, v) := max(docLeft(A,v), docLeft(B,v))
- Notice the above rules give us a recursive algorithm given a Boolean query `Q` to compute docRight, docLeft.
Algorithm to Return the Next Solution to a Positive Boolean Query (No NOT's).
nextSolution(Q, position)
{
v := docRight(Q, position);
if v = infty then
return infty;
u := docLeft(Q, v+1);
if(u == v) then
return u;
else
return nextSolution(Q, v);
}
Algorithm to Return All Solutions to a Positive Boolean Query
u := -infty
while u < infty do
u := nextSolution(Q, u);
if(u < infty) then
report docid(u);
If we implement nextDoc, prevDoc with galloping search, the complexity of this algorithm is
`O(n cdot l cdot log(L/l))`
Handling Queries with NOT
- To handle negations we first use De Morgan's rules to push negations so that they are only on terms.
- i.e., NOT (A AND B ) = NOT A OR NOT B and NOT (A OR B) = NOT A AND NOT B
- Then we enhance our methods nextDoc and prevDoc, so that they can directly handle negations. i.e., So nextDoc(NOT t, u) and prevDoc(NOT t, u) make sense.
- Finally, we set docRight(NOT t, u) := nextDoc(NOT t, u) and docLeft(NOT t,v) := prevDoc(NOT t,v).
Measuring the effectiveness
- Measuring the effectiveness of a retrieval method depends on human assesments of relevance.
- TREC experiments often use binary relevance judgements by humans. i.e., someone says, document X is or is not relevant
to the topic A.
- A document is typically judged relevant if any part of it is relevant.
- For example, for TREC topic 426, a user might formulate the Boolean query:
(("law" AND "enforcement") OR "police") AND ("dog" OR "dogs").
- Using the TREC45, (disk 4 and disk 5 of a TREC dataset of newspaper articles), this query produces 881 documents of the 500,000 or so in the collection.
Recall and Precision
- In order to determine the effectiveness of a Boolean Query, we compare (1) the set of documents returned by the query, Res, and (2) the set of relevant documents for the topic contained in the collections, Rel.
- From these two sets we can compute two common standard measures of effectiveness, recall and precision.
`mbox(recall) = frac(|Rel cap Res|)(|Rel|)`
`mbox(precision) = frac(|Rel cap Res|)(|Res|)`
- Recall indicates the fraction of relevant documents that appear in the result set, precision indicates the fraction of the result set that is relevant.
- According to NIST there are `202` relevant documents for topic 426. The above Boolean query only returns `167` of these. So the precision is `0.190`, and the recall is `0.827`.