HW#1 --- last modified February 10 2019 22:00:01..
Solution set.
Due date: Sep 12
Files to be submitted:
Hw1.zip
Purpose: To dust off our programming
skills and start working on coding an inverted index. To gain experience using
an open source crawler. To practice evaluating search results.
Related Course Outcomes:
The main course outcomes covered by this assignment are:
LO1 - Code a basic inverted index capable of performing conjunctive queries.
LO6 - Be able to evaluate search results by hand and using TREC eval
software.
Specification:
There are three parts to this assignment. For Part 1, I would like you to do Exercises: 1.1 and
1.3 out of the book. Write up your results in a file Hw1Part2.txt which you should include in
the zip file you submit. For Part 2, I want you to download Yioop! or Nutch and have it crawl the
site: mirror.cs.sjsu.edu. Perform some simple queries on the results of your crawl. In the file
Hw1Part2.txt describe how you configure Yioop! or Nutch. Look in the folder in which the data was
downloaded. In your write up list the name of this folder and the files you see. Give a short
explanation of each file. Finally, include the results of your queries. For the last part of the homework,
I want you to code a simple Java, C/C++, or PHP program that indexes are all the files in a folder.
Your code should come with a README.txt file which indicates how to compile it and run it. You can assume
I have available ant, javac, gcc, and Version 5.3 of PHP. No other compilers or interpreters will be supported.
Don't assume I have installed anything more than the default libraries -- if it doesn't compile, you are outta
luck so please keep your code as vanilla as possible. Your program should run
from the command line with the folder name to index as a command-line argument. Once run, your code should read each
file in that folder in into a data structure of your choice in memory. We assume for now that the files in this folder
are all.txt files and are all small enough so that all of them should fit in memory. Your program should sort all
terms seen in any document. For each term in this sorted order, your program should output to stdout, the term on a line by
itself. On the next line it should output the number of documents the term appeared in, comma, the number of occurrences of
the term across all documents, comma, and then a list a sequence of pairs in the form (doc_name, # of occurrence within this doc). Here
a given pair appears iff that doc contained the term at least once. A snippet of this table might look like:
aadvark
2,4,(some_file.txt,1),(another_file.txt,3)
aaron
1,1,(yet_another_file.txt,1)
...
Once done outputting this data your program should stop. That is all there is to Part 3.
Point Breakdown
Part 1 (each problem 1pt - 2 fully correct, 1 any defect or incomplete, 0 didn't do or was wrong) |
4pts
|
Part 2 (0.5pt test queries seem believable, 0.5 configuration write-up, 1 description of downloaded files). |
2pts
|
Part 3 (1pt Code well documented and follows departmental coding guidelines for language in question or PEAR guidelines for PHP,
1pt reads all the files in the folder into a reasonable data structure, 1pt seems to be calculating the desired statistics, 1pt
output as describe above). |
4pts
|
Total | 10pts |
|