CS297 Proposal

FullText Indexing for Heritrix

Darshan Karia (darshan.karia@students.sjsu.edu)

Advisor: Dr. Chris Pollett


Heritrix is an open-source web-scale, archive-quality, extensible web crawler project[Heritrix2010]. It stores the crawled web-results in arc file format.

NutchWAX (Nutch + Web Archive eXtensions) can be used to fetch data from web archives created by Heritrix[NutchWAX2009]. WERA can search and navigate through archived web documents[WERA2006] provided that the search engine holds a full-text index of the archived web documents and a document retriever to act as an interface between the access module and the web archive[WERAMan2006]. NutchWAX can be used as this interface to retrieve data from web archive based on request from WERA. My project is to provide Full-Text Indexing[FullText2009] to the web archive created by Heritrix.


Week 1: Aug.23-Aug.27Write CS 297 Proposal
Week 2: Aug.30-Sept.3Learn about Heritrix
Week 3: Sept.6-Sept.10Build Source Code of Heritrix
Week 4: Sept.13-Sept.17Deliverable 1 due
Week 5: Sept.20-Sept.24Understand Source Code of Heritrix
Week 6: Sept.27-Oct.1Modify the Source Code of Heritrix to experiment
Week 7: Oct.4-Oct.8Deliverable 2 due
Week 8: Oct.11-Oct.15See Capabilities of NutchWAX
Week 9: Oct.18-Oct.22See Capabilities of WERA
Week 10: Oct.25-Oct.29Deliverable 3 due
Week 11: Nov.1-Nov.5Learn about Full-Text Indexing
Week 12: Nov.8-Nov.12Learn about Inverted Indexes
Week 13: Nov.15-Nov.19Deliverable 4 due
Week 14: Nov.22-Nov.26Write CS 297 Report
Week 15: Nov.29-Dec.3Write CS 297 Report
Week 16: Dec.6-Dec.10Deliverable 5 due


The full project will be done when CS298 is completed. The following will be done by the end of CS297:

1. Experiment with Heritrix

2. Modify Source Code of Heritrix

3. See capabilities of NutchWAX and WERA

4. Learn about Full-text indexing and inverted indexes

5. CS 297 Report.


[Heritrix2010] Heritrix Web Crawler - http://crawler.archive.org/. 2010.

[WERA2007] WERA - http://archive-access.sourceforge.net/projects/wera/. 2007.

[NutchWAX2009] NutchWAX - http://archive-access.sourceforge.net/projects/nutch/. 2009.

[WERAMan2006] WERA Manual. Royal Library in Stockholm, Royal Library in Copenhagen, Helsinki University Library in Finland, National Library of Norway, National and University Library of Iceland. 2006.

[FullText2009] He, J., Yan, H., and Suel, T. 2009. Compact full-text indexing of versioned document collections. In Proceeding of the 18th ACM Conference on information and Knowledge Management (Hong Kong, China, November 02 - 06, 2009). CIKM '09. ACM, New York, NY, 415-424. DOI= http://doi.acm.org/10.1145/1645953.1646008