CS297 Proposal
FullText Indexing for Heritrix
Darshan Karia (darshan.karia@students.sjsu.edu)
Advisor: Dr. Chris Pollett
Description:
Heritrix is an open-source web-scale, archive-quality, extensible web crawler project[Heritrix2010]. It stores the crawled web-results in arc file format.
NutchWAX (Nutch + Web Archive eXtensions) can be used to fetch data from web archives created by Heritrix[NutchWAX2009]. WERA can search and navigate through archived web documents[WERA2006] provided that the search engine holds a full-text index of the archived web documents and a document retriever to act as an interface between the access module and the web archive[WERAMan2006]. NutchWAX can be used as this interface to retrieve data from web archive based on request from WERA. My project is to provide Full-Text Indexing[FullText2009] to the web archive created by Heritrix.
Schedule:
Week 1:
Aug.23-Aug.27 | Write CS 297 Proposal |
Week 2:
Aug.30-Sept.3 | Learn about Heritrix |
Week 3:
Sept.6-Sept.10 | Build Source Code of Heritrix |
Week 4:
Sept.13-Sept.17 | Deliverable 1 due |
Week 5:
Sept.20-Sept.24 | Understand Source Code of Heritrix |
Week 6:
Sept.27-Oct.1 | Modify the Source Code of Heritrix to experiment |
Week 7:
Oct.4-Oct.8 | Deliverable 2 due |
Week 8:
Oct.11-Oct.15 | See Capabilities of NutchWAX |
Week 9:
Oct.18-Oct.22 | See Capabilities of WERA |
Week 10:
Oct.25-Oct.29 | Deliverable 3 due |
Week 11:
Nov.1-Nov.5 | Learn about Full-Text Indexing |
Week 12:
Nov.8-Nov.12 | Learn about Inverted Indexes |
Week 13:
Nov.15-Nov.19 | Deliverable 4 due |
Week 14:
Nov.22-Nov.26 | Write CS 297 Report |
Week 15:
Nov.29-Dec.3 | Write CS 297 Report |
Week 16:
Dec.6-Dec.10 | Deliverable 5 due |
Deliverables:
The full project will be done when CS298 is completed. The following will
be done by the end of CS297:
1. Experiment with Heritrix
2. Modify Source Code of Heritrix
3. See capabilities of NutchWAX and WERA
4. Learn about Full-text indexing and inverted indexes
5. CS 297 Report.
References:
[Heritrix2010] Heritrix Web Crawler - http://crawler.archive.org/. 2010.
[WERA2007] WERA - http://archive-access.sourceforge.net/projects/wera/. 2007.
[NutchWAX2009] NutchWAX - http://archive-access.sourceforge.net/projects/nutch/. 2009.
[WERAMan2006] WERA Manual. Royal Library in Stockholm, Royal Library in Copenhagen, Helsinki University Library in Finland, National Library of Norway, National and University Library of Iceland. 2006.
[FullText2009] He, J., Yan, H., and Suel, T. 2009. Compact full-text indexing of versioned document collections. In Proceeding of the 18th ACM Conference on information and Knowledge Management (Hong Kong, China, November 02 - 06, 2009). CIKM '09. ACM, New York, NY, 415-424. DOI= http://doi.acm.org/10.1145/1645953.1646008
|