CS267 Spring 2022Practice Final

To study for the final I would suggest you:

Know how to do (by heart) all the practice problems.
Go over your notes at least three times. Second and third time try to see how much you can remember from the first time.
Go over the homework problems.
Try to create your own problems similar to the ones I have given and solve them.
Skim the relevant sections from the book.
If you want to study in groups, at this point you are ready to quiz each other.

Here are some facts about the actual final:

It is comprehensive.
It is closed book, closed notes. Nothing will be permitted on your desk except your pen (pencil) and test.
You should bring photo ID.
There will be more than one version of the test. Each version will be of comparable difficulty.
It is 10 problems (3pts each), 6 problems will be on materials since the second midterm, 4 problems will be from the topics of the midterm.
Two problems will be exactly (less typos) off of the practice final, and one will be off of the practice midterm.

The practice final is below:

Prove there is an upper bound that a single term's BM25 score for a single document can contributed to the overall BM25 score for that document with respect to a query.
Give the context in which one might use accumulator pruning for ranking. Then explain and give an example of how the accumulator pruning algorithm from class works.
Express the following as region algebra queries: (a) Your first name within 8 terms of your last name. (b) Your zip code in a
tags before the phrase "doxing is fun".
Give a concrete example involving your name of how a Canonical Huffman code might be written as a preamble to encoding a string.
Briefly describe how each of the following coding schemes for compression lists work: (a) `gamma`-code, (b) LLRUN, (c) Rice code.
Give a situation with numbers where it would make sense to rebuild an index rather than remerge.
Suppose there is a 1,000,000 word corpus. Document `d` is 450 words long. The terms blue and suede occur 800, 100 times respectively. In document `d` blue occurs twice and suede once. Let `mu=1000`. What is the LMD score for d for the query "blue suede"?
Distinguish intra-query and inter-query parallelism as far as information retrieval goes. What bottlenecks exists to intra-query parallelism if partition-by-document is used.
Explain how hub and authority scores are calculated in SALSA.
Briefly define aggregate P@k and aggregate AP and explain why they might be used.