Chris Pollett> CS267
( Print View )

Student Corner:
[Lecture Notes]
[Discussion Board]

Course Info:
[Texts & Links]
[Description]
[Course Outcomes]
[Outcomes Matrix]
[Course Schedule]
[Grading]
[Requirements/HW/Quizzes]
[Class Protocols]
[Exam Info]
[Regrades]
[University Policies]
[Announcements]

HW Assignments:
[Hw1] [Hw2] [Hw3]
[Hw4] [Hw5] [Quizzes]

Practice Exams:
[Midterm] [Final]

CS267 Fall 2023Practice Midterm 1

Studying for one of my tests does involve some memorization. I believe this is an important skill. Often people waste a lot of time and fail to remember the things they are trying to memorize. Please use a technique that has been shown to work such as the method of loci. Other memorization techniques can be found off the Wiki Page for Moonwalking with Einstein. Given this, to study for the midterm I would suggest you:

  • Know how to do (by heart) all the practice problems.
  • Go over your notes at least three times. Second and third time try to see how much you can remember from the first time.
  • Go over the homework problems.
  • Try to create your own problems similar to the ones I have given and solve them.
  • Skim the relevant sections from the book.
  • If you want to study in groups, at this point you are ready to quiz each other.

The practice midterm is below. Here are some facts about the actual midterm: (a) It is closed book, closed notes. Nothing will be permitted on your desk except your pen (pencil) and test. (b) You should bring photo ID. (c) There will be more than one version of the test. Each version will be of comparable difficulty. (d) One problem (less typos) on the actual test will be from the practice test.

  1. Define the following information retrieval terms: (a) probability ranking principle, (b) specificity, (c) exhaustivity, (d) novelty.
  2. Suppose one had a corpus of Barack Obama speeches from which one developed a language model `M`. From these speeches it can be determined that he uses a introduces new word with probability `1/(mbox(current_corpus_length))`. Suppose the current corpus length 100,000 words. Determine a language model `M'` that would include his next speech (which we know in advance is 1000 words).
  3. Define and give an example posting for the following index types: (a) docid index, (b) frequency index, (c) positional index, (d) schema-independent index.
  4. Suppose on a query for "Eloise et Abelard" a search engine returns `3000` results, `500` of which are relevant. There are in the indexed corpus `4000` documents relevant to this query. Calculate the precision and recall given these numbers.
  5. Suppose we have three topic areas and two relevant documents for each topic areas. Assume your search engine eventually returns the relevant results for each topic. Given a concrete example showing one possible MAP value might be calculated (i.e., you can say what rank your search engine returns the relevant results, but otherwise the computation is determined).
  6. Suppose a posting list for a term `t` for a schema-independent index consisted of the numbers `2,3,9,12,77, 470, 1100, 1400, 2300`. Explain how the galloping search algorithm from class would compute `mbox(next)(t, 499)`.
  7. Explain how pooling can be used to assign relevance judgements when a corpus of documents is too large.
  8. Give the Proximity Ranking algorithm discussed in class. Explain how ranking can be done using the vector space emodel and TF-IDF scores. Give a concrete example being specific about how you are handling TF-IDF scores for queries.
  9. Briefly explain (a) how autoloading in PHP works, (b) how to set up a composer project to use Yioop.
  10. Suppose n=4 what would be the character n-grams for the word caramel? What is stopping? What is stemming ? Give an example of the kinds of rules used by a Porter stemmer.