CS267 Fall 2019Practice Final

To study for the final I would suggest you:

Know how to do (by heart) all the practice problems.
Go over your notes at least three times. Second and third time try to see how much you can remember from the first time.
Go over the homework problems.
Try to create your own problems similar to the ones I have given and solve them.
Skim the relevant sections from the book.
If you want to study in groups, at this point you are ready to quiz each other.

Here are some facts about the actual final:

It is comprehensive.
It is closed book, closed notes. Nothing will be permitted on your desk except your pen (pencil) and test.
You should bring photo ID.
There will be more than one version of the test. Each version will be of comparable difficulty.
It is 10 problems (2pts each), 6 problems will be on materials since the second midterm, 4 problems will be from the topics of the midterm.
Two problems will be exactly (less typos) off of the practice final, and one will be off of the practice midterm.

The practice final is below:

Prove there is an upper bound that a single term's BM25 score for a single document can contributed to the overall BM25 score for that document with respect to a query.
Give the context in which one might use accumulator pruning for ranking. Then explain and give an example of how the accumulator pruning algorithm from class works.
What file does the trec_eval software take as input? Briefly give the format of each. What kind of output does it produce?
Suppose 'a' occurs with probability 0.65 and 'b' with probability 0.35. Suppose our symbols are bi-grams of these two letters. Draw the Huffman tree for these bi-grams. What would the preamble for the canonical Huffman code look like?
Briefly explain how the following index compression codes work: (a) `gamma`-code. (b) Golomb code.
Explain what kind of incremental updates logarithmic merging is a suitable algorithhm for. Explain how this algorithm works.
Give the following ranking functions together with common values for any magic numbers they have: (a)LMD, (b) LMJM, (c) DFR. Define each of the variables your formulas have and give an example of using them.
Suppose we are using document partitioning to split our index across 5 machines. When a user enters a query our search engines returns the top 10 results. What is the probability machine 1 has half of the top 10 results?
Briefly explain how the page rank algorithm might be implemented as a map reduce job.
What is Batch filtering? What is aggregate P@k?