HW#1 --- last modified August 28 2023 15:23:03.

Solution set.

Due date: Feb 22

Files to be submitted:
  Hw1.zip

Purpose: To gain some experience with language modeling, to dust off our programming skills, and to start work on coding an inverted index. To gain experience using an open source crawler. To practice evaluating search results.

Related Course Outcomes:

The main course outcomes covered by this assignment are:

LO1 - Code a basic inverted index capable of performing conjunctive queries.

LO6 - Be able to evaluate search results by hand and using TREC eval software.

Description:

This homework has three parts, some text-based problems, a write-up of some experiments with Yioop, and a coding part. For each part, I will say file names of where I want you to put stuff. After you are done the homework take all the files you have created put them into a folder Hw1, zip it to make a file Hw1.zip, and submit that.

The first file I want you to make is the plain text file problems.txt. In it put your answers to the following questions:

  1. Open Source Shakespeare has stats about Shakespeare's works. For example, there are 884,421 total words in Shakespeare's 43 works. Using this site you can calculate the maximum likelihood estimate for any term in Shakespeare. (a) Compute the MLE for each of the terms: (i) all (ii) the (iii) world, (i) is, (v) stage, (vi) and. Then (b) compute using these the likelihood of the phrase: All the world is a stage and...
  2. Pick a paragraph out of your favorite IMDB movie page. For each term in this paragraph determine its frequency. Create a log scale plot of frequency versus rank. For this small a sample size, how well is Zipf's Law followed?
  3. Using the transition matrix `M` given in the book for the Markov model in Figure 1.6, compute `(1, 0, 0, 0)M^{3}`, `(1, 0, 0, 0) M^{9}`, ..., `(1, 0, 0, 0)M^{30}` (might help to use a Math tool like Wolfram alpha, Mathematica or write a short program). Do they converge? If so, try to find a math result (by searching around online on say conditions for page rank converging or if you're a genius by hand) which might explain why or why not.

For Part 2, I want you to download and install the Yioop search engine using git clone. I want you to conduct an experiment to compare the different Yioop's page summarizers available in Yioop. First, pick five Amazon Products of your choice. If possible, view one of these products in a language other than English in which you are fluent. For each page, come up with a list of queries that that page might be able to answer. Make sure your queries only involve words that appear on the given page. Next I want you to create by hand a short summary (say 5 sentences) for each of these product pages. Your summaries should be in the same language as those pages. If you are working in a group of more than one person have a different person make the queries from the one who makes the summary. Next, for each of your summaries look at the list of five queries for the page the summary came from. Determine which queries have all their terms still in the summary and make a count of this for each page. For example, you might have product 1 had 3 out 5 queries completely contained in the summary. Log in to the root account of your copy of Yioop, go to the Page Options activity. Here you can select between several summarizers for a web page. I want you to conduct your experiments using the BASIC and Centroid Weighted Summarizers. Under Page Options go to the Test Options tab. For each Amazon product you chose, copy and paste its HTML source into the Test Options text area and click Test Process Page. Look at the output and determine the number of times Yioop got the Human language of the Amazon product correct. Then take the summaries you get from each Amazon product and for each of the two summarizers, calculate as you did for your own human summaries the number of times the summary still contained all of the query terms for each of your queries. Next, use the Web Scraper activity in Yioop to make a Web Scraper for Amazon pages, and see if you can improve your results. Write up all of these experiments saying which summarizer performed better, how it compared to a human summarizer, and maybe a paragraph conclusion in the file summaries.txt.

For Part 3, I want you to code a simple inverted index printer. It will be run from the command line with a line like:

aux_program IndexPrinter folder start_term end_term

Here aux_program is at most one of java, php, python, (your choice). For php or python, IndexPrinter should be either IndexPrinter.php or IndexPrinter.py respectively If you choose to do this project in C, C++, or Rust, aux_program is optional. Your program will be compiled on a Mac running MacOS Big Sur, having used brew to install the compiler. folder is the name of some folder on my hard drive. This folder will contain one or more .txt documents named in numeric order 1.txt, 2.txt, 3.txt ... These files will contain only ASCII characters. In these files, a "term" is any sequence of characters that does not include a white-space character. Once run, your code should read in all of these files (they should all fit in RAM) into a data structure of your choice in memory. Your program should sort all terms seen in any document. For each term in this sorted order, your program should output to stdout, all information about terms lexicographically equal to or greater than start_term, and less than or equal to end_term in the following format. First, the term should be output on a line by itself. On the next line it should output the number of documents the term appeared in, comma, the number of occurrences of the term across all documents, comma, and then a list a sequence of pairs in the form (doc_num, character position within the document that the term appears at). If start_term was bob and end term was bongo, and there was one term, bold, in the folder lexicographically between bob and bongo, this table might look like:

bob
2,4,(5,1),(8,6),(8,21),(8,30)
bold
2,2,(2,1),(8,30)
bongo
2,3,(7,4),(8,40),(8,60)

In the above (5,1) indicates the term a bob occurred once in the fifth document (5.txt) and it appeared at character offset 1. Once done outputting this data your program should stop. That is all there is to Part 3.

Point Breakdown
Part 1 (each problem 1pt - 1 fully correct, 0.5 any defect or incomplete, 0 didn't do or was wrong) 3pts
Part 2 summaries.txt tests and write up (1pt human tests, 1pt Yioop tests (0.5 no web scraper, 0.5 scraper), 1pt overall write-up 3pts
IndexPrinter creates an inverted index in memory (1pt keeps count of num_docs and frequency for each term, 1pt keeps track of information needed for each record associated with an entry). 2pts
IndexPrinter outputs the correct rows of the index in the desired format 1pt
Cause I can't add 1pt
Total10pts