HW#1 --- last modified February 10 2019 14:00:01..

Due date: Sep 12

Files to be submitted:
Hw1.zip

Purpose: To dust off our programming skills and start working on coding an inverted index. To gain experience using an open source crawler. To practice evaluating search results.

Related Course Outcomes:

The main course outcomes covered by this assignment are:

LO1 - Code a basic inverted index capable of performing conjunctive queries.

LO6 - Be able to evaluate search results by hand and using TREC eval software.

Specification:

There are three parts to this assignment. For Part 1, I would like you to do Exercises: 1.1 and 1.3 out of the book. Write up your results in a file Hw1Part2.txt which you should include in the zip file you submit. For Part 2, I want you to download Yioop! or Nutch and have it crawl the site: mirror.cs.sjsu.edu. Perform some simple queries on the results of your crawl. In the file Hw1Part2.txt describe how you configure Yioop! or Nutch. Look in the folder in which the data was downloaded. In your write up list the name of this folder and the files you see. Give a short explanation of each file. Finally, include the results of your queries. For the last part of the homework, I want you to code a simple Java, C/C++, or PHP program that indexes are all the files in a folder. Your code should come with a README.txt file which indicates how to compile it and run it. You can assume I have available ant, javac, gcc, and Version 5.3 of PHP. No other compilers or interpreters will be supported. Don't assume I have installed anything more than the default libraries -- if it doesn't compile, you are outta luck so please keep your code as vanilla as possible. Your program should run from the command line with the folder name to index as a command-line argument. Once run, your code should read each file in that folder in into a data structure of your choice in memory. We assume for now that the files in this folder are all.txt files and are all small enough so that all of them should fit in memory. Your program should sort all terms seen in any document. For each term in this sorted order, your program should output to stdout, the term on a line by itself. On the next line it should output the number of documents the term appeared in, comma, the number of occurrences of the term across all documents, comma, and then a list a sequence of pairs in the form (doc_name, # of occurrence within this doc). Here a given pair appears iff that doc contained the term at least once. A snippet of this table might look like:

aadvark
2,4,(some_file.txt,1),(another_file.txt,3)
aaron
1,1,(yet_another_file.txt,1)
...

Once done outputting this data your program should stop. That is all there is to Part 3.

Point Breakdown

Part 1 (each problem 1pt - 2 fully correct, 1 any defect or incomplete, 0 didn't do or was wrong)	4pts
Part 2 (0.5pt test queries seem believable, 0.5 configuration write-up, 1 description of downloaded files).	2pts
Part 3 (1pt Code well documented and follows departmental coding guidelines for language in question or PEAR guidelines for PHP, 1pt reads all the files in the folder into a reasonable data structure, 1pt seems to be calculating the desired statistics, 1pt output as describe above).	4pts
Total	10pts