FullText Indexing for Heritrix
Darshan Karia (firstname.lastname@example.org)
Advisor: Dr. Chris Pollett
Heritrix is an open-source web-scale, archive-quality, extensible web crawler project[Heritrix2010]. It stores the crawled web-results in arc file format.
NutchWAX (Nutch + Web Archive eXtensions) can be used to fetch data from web archives created by Heritrix[NutchWAX2009]. WERA can search and navigate through archived web documents[WERA2006] provided that the search engine holds a full-text index of the archived web documents and a document retriever to act as an interface between the access module and the web archive[WERAMan2006]. NutchWAX can be used as this interface to retrieve data from web archive based on request from WERA. My project is to provide Full-Text Indexing[FullText2009] to the web archive created by Heritrix.
The full project will be done when CS298 is completed. The following will be done by the end of CS297:
1. Experiment with Heritrix
2. Modify Source Code of Heritrix
3. See capabilities of NutchWAX and WERA
4. Learn about Full-text indexing and inverted indexes
5. CS 297 Report.
[Heritrix2010] Heritrix Web Crawler - http://crawler.archive.org/. 2010.
[WERA2007] WERA - http://archive-access.sourceforge.net/projects/wera/. 2007.
[NutchWAX2009] NutchWAX - http://archive-access.sourceforge.net/projects/nutch/. 2009.
[WERAMan2006] WERA Manual. Royal Library in Stockholm, Royal Library in Copenhagen, Helsinki University Library in Finland, National Library of Norway, National and University Library of Iceland. 2006.
[FullText2009] He, J., Yan, H., and Suel, T. 2009. Compact full-text indexing of versioned document collections. In Proceeding of the 18th ACM Conference on information and Knowledge Management (Hong Kong, China, November 02 - 06, 2009). CIKM '09. ACM, New York, NY, 415-424. DOI= http://doi.acm.org/10.1145/1645953.1646008