Chris Pollett >
Students > [Bio] [CS298 Spring 2011 - Progress Report] |
CS297 ProposalFullText Indexing for HeritrixDarshan Karia (darshan.karia@students.sjsu.edu) Advisor: Dr. Chris Pollett Description: Heritrix is an open-source web-scale, archive-quality, extensible web crawler project[Heritrix2010]. It stores the crawled web-results in arc file format. NutchWAX (Nutch + Web Archive eXtensions) can be used to fetch data from web archives created by Heritrix[NutchWAX2009]. WERA can search and navigate through archived web documents[WERA2006] provided that the search engine holds a full-text index of the archived web documents and a document retriever to act as an interface between the access module and the web archive[WERAMan2006]. NutchWAX can be used as this interface to retrieve data from web archive based on request from WERA. My project is to provide Full-Text Indexing[FullText2009] to the web archive created by Heritrix. Schedule:
Deliverables: The full project will be done when CS298 is completed. The following will be done by the end of CS297: 1. Experiment with Heritrix 2. Modify Source Code of Heritrix 3. See capabilities of NutchWAX and WERA 4. Learn about Full-text indexing and inverted indexes 5. CS 297 Report. References: [Heritrix2010] Heritrix Web Crawler - http://crawler.archive.org/. 2010. [WERA2007] WERA - http://archive-access.sourceforge.net/projects/wera/. 2007. [NutchWAX2009] NutchWAX - http://archive-access.sourceforge.net/projects/nutch/. 2009. [WERAMan2006] WERA Manual. Royal Library in Stockholm, Royal Library in Copenhagen, Helsinki University Library in Finland, National Library of Norway, National and University Library of Iceland. 2006. [FullText2009] He, J., Yan, H., and Suel, T. 2009. Compact full-text indexing of versioned document collections. In Proceeding of the 18th ACM Conference on information and Knowledge Management (Hong Kong, China, November 02 - 06, 2009). CIKM '09. ACM, New York, NY, 415-424. DOI= http://doi.acm.org/10.1145/1645953.1646008 |