Chris Pollett>
CS267 |
HW#1 --- last modified February 14 2022 22:52:08.Due date: Feb 14 Files to be submitted: Purpose: To gain some experience with language modeling, to dust off our programming skills, and to start work on coding an inverted index. To gain experience using an open source crawler. To practice evaluating search results. Related Course Outcomes: LO1 - Code a basic inverted index capable of performing conjunctive queries. LO6 - Be able to evaluate search results by hand and using TREC eval software. Specification: This homework has three parts, some text-based problems, a write-up of some experiments with Yioop, and a coding part. For each part, I will say file names of where I want you to put stuff. After you are done the homework take all the files you have created put them into a folder Hw1, zip it to make a file Hw1.zip, and submit that. Make sure this folder is smaller than 10MB so it can be uploaded to my homework submission system. The first part consists of the following exercises which shuld be included in a file Hw1.pdf.
For the second part of the homework, I want you to download and install Yioop as described in class. Consider the web pages: Amazon landing page, Rotten Tomatoes (Movie Reviews) landing page, and the The Sweet (Glam Rock Band) Wikipedia page. Without looking at these pages identify three information needs you think the page would provide in descending order of importance. For each of these information needs, come up with a query you might type into a search engine to obtain this information need. Next look at each of these pages, both normally in a browser, and by viewing the source html. For each page, by hand make a five sentence summary of the contents of the page using only phrases appearing on the page. For each of the queries you made, calculate by hand the number of the summaries you created that contained all the query words. Log in to the root account of your copy of Yioop, go to the Page Options activity. Here you can select between several summarizers for a web crawl. I want you to conduct your experiments using the BASIC and Centroid Weighted Summarizers. Under Page Options go to the Test Options tab and choose by URI as your method of testing, then test each of these web pages. By looking at the output of these tests determine what Yioop though were the top 5 sentences for each page (Look at both the description and the description scores to figure this out). Then for Yioop's five sentence summaries, determine for each query which summaries had all the query terms. How did Yioop's summarizers compare with the human results? Write your analysis in summaries.pdf and include this in your Hw1.zip file. For Part 3, I want you to code a simple inverted index printer for character n-grams. You have some flexibility in the choice of the language you use to code this project (see below). It will be run from the command line with a line like: aux_program CharGramIndexPrinter corpus_file n Here aux_program is either empty (if you decide to use C, C++, or Rust) or java, node, php, python (your choice if you decide to code using of of these languages). For node, php, or python, CharGramIndexPrinter should be either CharGramIndexPrinter.js, CharGramIndexPrinter.php, or CharGramIndexPrinter.py respectively. Your program will be compiled on a Mac running MacOS Monterey, having used brew to install the compiler/interpreter. As an example input line, I might use in running the program is: php CharGramIndexPrinter.php some-corpus.txt 4 The corpus_file is assumed to be a ASCII text and \n is used to indicate a newline. Within it, a line beginning with --- (three hyphens) indicates the start of a new document. A "term" within a document is any sequence of `n` characters where `n` comes from the command line. All characters are converted to lower case before processing. A whitespace character is converted to an underscore _, before processing and each document is prepended with n-1 underscores and append with `n-1` underscores before processing. So for example, if some-corpus.txt was: The quick or the brown fox --- jumped over sally --- and the brilliant bob fox --- but not the brown peter bunny The first document before processing would be converted to the string: ___the_quick_or_the_brown_fox___ The character 4 grams in this document would be: ___t __th _the the_ he_q e_qu _qui quic uick ick_ ck_o k_or _or_ or_t r_th _the the_ he_b e_br _bro brow rown own_ wn_f n_fo _fox fox_ ox__ x___ The corpus would be considered to have four documents: (1) The quick or the brown fox, (2) jumped over sally, (3) and the brilliant bob fox, (4) but not the brown peter bunny. When run, your program should sort all terms seen in any document. For each term in this sorted order, your program should output to stdout, the term on a line by itself. On the next line it should output the number of documents the term appeared in, comma, the number of occurrences of the term across all documents, comma, and then a list a sequence of pairs in the form (doc_num, character position within the document that the term appears at). For example, the entry for _jum might look like: _jum 1,1,(1,2) Notice the first document is treated as 0, the second as 1, similarly the first position within a document is treated a 0, the next as 1, etc. Once done outputting this data for every term, your program should stop. That is all there is to Part 3. Point Breakdown
|