CS297 Proposal
Yioop! Full Historical Indexing in Cache Navigation
Akshat Kukreti (akshat.kukreti@students.sjsu.edu)
Advisor: Dr. Chris Pollett
Description:
Yioop! displays a link to cached versions of web pages when showing the results
of a query. The links within these cached pages redirect the user to pages
that are live at that time. The first goal of my project is to add a feature to Yioop!
that enables following links to cached versions of web pages instead of
live ones depending upon the time when the parent web page was crawled. This
feature will be similar to The Internet Archive WayBack Machine. The user
will also be able to do a text search in the cached version. Yioop! is currently
optimized to search single indexes at a time. The second goal of the project is to
allow fast searches in multiple indexes. The third goal of the project is to modify
the fetchers for handling Etags.
Schedule:
Week 1:
Sep.4-11 | Read [Kahle 96], [Rackley 09] |
Week 2:
Sep.12-18 | Understand the JavaScript used in the
Internet Archive |
Week 3:
Sep.19-25 | Deliverable 1: Code a script similar to the one
read in the previous week and test it. |
Week 4:
Sep.26-Oct.2 | Read chapter 15 [Buttcher 10] and understand
how Yioop! searches across an index |
Week 5:
Oct.3-9 | Understand the Yioop! index dictionary |
Week 6:
Oct.10-16 | Deliverable 2: Modify the Yioop! index
dictionary and test search across multiple indexes |
Week 7:
Oct.17-23 | Read chapter 13 [Buttcher 10] |
Week 8:
Oct.24-30 | Understand Yioop!'s caching mechanism |
Week 9:
Oct.31-Nov.6 | Deliverable 3: Alter links and test
modified links for redirection to cached results |
Week 10:
Nov.7-13 | Read about Etags |
Week 11:
Nov.14-20 | Read and understand Yioop! fetcher code |
Week 12:
Nov.21-27 | Deliverable 4: Test modified Yioop! fetcher
for handling of Etags |
Week 13-14:
Nov.28-Dec.11 | Work on final report |
Week 15:
Dec.12-18 | Deliverable 5: CS297 report |
Deliverables:
The full project will be done when CS298 is completed. The following will
be done by the end of CS297:
1. Code a script similar to the JavaScript used in the Internet archive and
test it.
2. Understand the indexing in Yioop! and make a non-trivial modification
to the index dictionary in Yioop! to enable searching across multiple indexes.
3. Alter links so that they go to cached results within a single index.
4. Modify the Yioop! fetcher and test for handling Etags
5. CS297 final report containing a summary of what was done during the
semester, and future work.
References:
[Buttcher 10] Information Retrieval: Implementing and Evaluating Search Engines.
Stefan Buttcher, Charles L.A. Clarke, Gordon V. Cormack.
The MIT Press. 2010.
[Kahle 96] Archiving the Internet. Brewster Kahle. Internet Archive. 1996.
[Rackley 09] Internet Archive. Marilyn Rackley. Library, Harvard University,
Cambridge, Massacheussets, U.S.A. 2009.
|