CS267 Spring 2021 Practice Midterm

For the midterm I would suggest you:

  • Know how to do (by heart) all the practice problems.
  • Go over your notes at least three times. Second and third time try to see how much you can remember from the first time.
  • Go over the homework problems.
  • Try to create your own problems similar to the ones I have given and solve them.
  • If you want to study in groups, at this point you are ready to quiz each other.

The practice midterm is below. Here are some facts about the actual midterm:

  1. It is open book, open notes, open internet, however, you are not allowed to discuss problems on the midterm with classmates or other people until after the midterm submission period is over.
  2. It will be posted by the start of class time on the midterm day as one of the links on the side of the class homepage.
  3. You submit the midterm for grading using the same mechanism as to submit homeworks.
  4. It is designed to take an 1h15m, however, you can submit it up until midnight on the day it is posted.
  5. Some problems will be dependent on some personal information about you such as your ID, and you will not receive credit if you don't make use of this information correctly.
  6. One problem will be off the practice midterm.

Practice Problems

On practice midterm day, please post your problem solutions to the Practice Midterm Thread.

  1. Define the following IR terms and give an example: (a) the Probability Ranking Principle, (b) specificity, (c) exhaustivity, and (d) novelty.
  2. Pick your favorite (family-friendly) limerick. Give the probability of each term in this limerick according to a first order language model. Compute the MLE of the second line.
  3. Consider the "As I was going to St Ives ..." riddle, explain how the nextPhrase algorithm would work on inputs: nextPhrase("had", "seven", 10). As there are variants to the riddle, tell me which one you are using as you go through the solution.
  4. Suggest a data structure that could be used to implement an inverted index in PHP, and give a working code snippet to show how to implement first($t), last($t) from our inverted index ADT.
  5. Give pseudo-code for an galloping search implementation of prev(t, pos), where prev is from out inverted index ADT.
  6. Define the following terms (1pt each): (a) docid index, (b) frequency index, (c) positional index, and (d) schema-independent index.
  7. In the following two documents fill in the blank with your home town and your favorite holiday respectively. (1) When I was little, I don't recall space aliens visiting ____. (2) Maybe the space aliens were waiting for ____. Assume we are using the vector space model with TF-IDF scores for the components. Compute the TF-IDF vectors for each document. Compute their cosine similarity.
  8. Give the nextSolution algorithm from class for positive boolean queries. Show how it would work on a concrete corpus and query.
  9. Suppose `n=3` what would be the character `n`-grams for the word toast? (1pt) What is stopping? (1pt) What is stemming (1pt)? Give an example of the kinds of rules used by a Porter stemmer.
  10. Briefly explain and give example of the following (1pt each): (a) sort-based dictionary, (b) per-term index, (c) move-to-front heuristic, (d) merge based index construction.