Student Corner:
[Final-PDF]

[Submit Sec1]

[Lecture Notes]
[Discussion Board]

Course Info:
[Description]
[Course Outcomes]
[Outcomes Matrix]
[Course Schedule]
[Requirements/HW/Quizzes]
[Class Protocols]
[Exam Info]
[University Policies]
[Announcements]

HW Assignments:
[Hw1] [Hw2] [Hw3]
[Hw4] [Hw5] [Quizzes]

Practice Exams:
[Midterm] [Final]

Due date: Feb 22

Files to be submitted:
Hw1.zip

Purpose: To gain some experience with language modeling, to dust off our programming skills, and to start work on coding an inverted index. To gain experience using an open source crawler. To practice evaluating search results.

Related Course Outcomes:

The main course outcomes covered by this assignment are:

LO1 - Code a basic inverted index capable of performing conjunctive queries.

LO6 - Be able to evaluate search results by hand and using TREC eval software.

Description:

This homework has three parts, some text-based problems, a write-up of some experiments with Yioop, and a coding part. For each part, I will say file names of where I want you to put stuff. After you are done the homework take all the files you have created put them into a folder Hw1, zip it to make a file Hw1.zip, and submit that.

The first file I want you to make is the plain text file problems.txt. In it put your answers to the following questions:

1. Open Source Shakespeare has stats about Shakespeare's works. For example, there are 884,421 total words in Shakespeare's 43 works. Using this site you can calculate the maximum likelihood estimate for any term in Shakespeare. (a) Compute the MLE for each of the terms: (i) all (ii) the (iii) world, (i) is, (v) stage, (vi) and. Then (b) compute using these the likelihood of the phrase: All the world is a stage and...
2. Pick a paragraph out of your favorite IMDB movie page. For each term in this paragraph determine its frequency. Create a log scale plot of frequency versus rank. For this small a sample size, how well is Zipf's Law followed?
3. Using the transition matrix M given in the book for the Markov model in Figure 1.6, compute (1, 0, 0, 0)M^{3}, (1, 0, 0, 0) M^{9}, ..., (1, 0, 0, 0)M^{30} (might help to use a Math tool like Wolfram alpha, Mathematica or write a short program). Do they converge? If so, try to find a math result (by searching around online on say conditions for page rank converging or if you're a genius by hand) which might explain why or why not.

For Part 3, I want you to code a simple inverted index printer. It will be run from the command line with a line like:

aux_program IndexPrinter folder start_term end_term


Here aux_program is at most one of java, php, python, (your choice). For php or python, IndexPrinter should be either IndexPrinter.php or IndexPrinter.py respectively If you choose to do this project in C, C++, or Rust, aux_program is optional. Your program will be compiled on a Mac running MacOS Big Sur, having used brew to install the compiler. folder is the name of some folder on my hard drive. This folder will contain one or more .txt documents named in numeric order 1.txt, 2.txt, 3.txt ... These files will contain only ASCII characters. In these files, a "term" is any sequence of characters that does not include a white-space character. Once run, your code should read in all of these files (they should all fit in RAM) into a data structure of your choice in memory. Your program should sort all terms seen in any document. For each term in this sorted order, your program should output to stdout, all information about terms lexicographically equal to or greater than start_term, and less than or equal to end_term in the following format. First, the term should be output on a line by itself. On the next line it should output the number of documents the term appeared in, comma, the number of occurrences of the term across all documents, comma, and then a list a sequence of pairs in the form (doc_num, character position within the document that the term appears at). If start_term was bob and end term was bongo, and there was one term, bold, in the folder lexicographically between bob and bongo, this table might look like:

bob
2,4,(5,1),(8,6),(8,21),(8,30)
bold
2,2,(2,1),(8,30)
bongo
2,3,(7,4),(8,40),(8,60)


In the above (5,1) indicates the term a bob occurred once in the fifth document (5.txt) and it appeared at character offset 1. Once done outputting this data your program should stop. That is all there is to Part 3.

 Part 1 (each problem 1pt - 1 fully correct, 0.5 any defect or incomplete, 0 didn't do or was wrong) 3pts Part 2 summaries.txt tests and write up (1pt human tests, 1pt Yioop tests (0.5 no web scraper, 0.5 scraper), 1pt overall write-up 3pts IndexPrinter creates an inverted index in memory (1pt keeps count of num_docs and frequency for each term, 1pt keeps track of information needed for each record associated with an entry). 2pts IndexPrinter outputs the correct rows of the index in the desired format 1pt Cause I can't add 1pt Total 10pts