CS298 Proposal
Full Text Indexing for Heritrix
Darshan Karia (darshan.karia@students.sjsu.edu)
Advisor: Dr. Chris Pollett
Committee Members: Dr. Mark Stamp, Dr. Jeff Smith.
Abstract:
My project mainly focuses on Indexing of the archive (ARC) files. I will develop a Java module to perform indexing on these files. I will use readily available archive files to test indexing performance of the module, once indexing module is ready. I will also do performance benchmarking for this indexer and compare the results with various other indexers.
To make this indexing worth using, I will also develop a user interface for querying archive via this index in form of search query and displaying results based on that. Once the whole project is ready, it can be used for indexing large and personalized crawls archive files generated by heritrix [Heritrix2010].
CS297 Results
- Crawled web using Heritrix and generated sample archive files.
- Modified format of archive file with addition of robot.txt file location for that domain.
- Saw the capabilities of NutchWAX [NutchWAX2009] and WERA [WERA2007].
- Learnt about full-text indexing and inverted indexes from the book titled "Information Retrieval" [BCC2010].
Proposed Schedule
Week 1:
Feb.01-Feb.07 | Proposal |
Week 2:
Feb.08-Feb.14 | Work on Index Structure Design |
Week 3:
Feb.15-Feb.21 | Finalize Index Structure Design |
Week 4:
Feb.22-Feb.28 | Deliverable 1 Due: Design Indexing Scheme, which helps in faster retrieval of data from many large archive files |
Week 5:
March.01-March07 | Initial Development for Indexing Module |
Week 6:
March.08-March.14 | Finalize Development |
Week 7:
March.15-March.21 | Deliverable 2 Due: Develop Indexing Module, which actually performs indexing operation on archive files and stores inverted index for words in documents to make word based query for searching |
Week 8:
March.22-March.28 | Do Performance Testing with larger files to index |
Week 9:
March.29-April.03 | Spring Recess |
Week 10:
April.04-April.10 | Deliverable 3 Due: Performance Benchmarking against various other indexers using online resources |
Week 11:
April.11-April.17 | Develop Search Module |
Week 12:
April.18-April.24 | Deliverable 4 Due: Develop Search Module, which can respond to search query of user and retrieve result from archive files using the index developed by Indexing Module |
Week 13:
April.25-May.01 | Start Writing Report |
Week 14:
May.02-May.08 | Finalize Report |
Week 15:
May.09-May.15 | Defense. |
Key Deliverables:
- Software
- Deliverable 1: Design Indexing Scheme, which helps in faster retrieval of data from many large archive files
- Deliverable 2: Develop Indexing Module, which actually performs indexing operation on archive files and stores inverted index for words in documents to make word based query for searching
- Deliverable 3: Performance Benchmarking against various other indexers using online resources
- Deliverable 4: Develop Search Module, which can respond to search query of user and retrieve result from archive files using the index developed by Indexing Module
- Report
Innovations and Challenges
- Main challenge in this project is to come up with an efficient indexing format.
- Design search interface with JSP and communicate with Java classes to retrieve search results from archive files.
References:
[Heritrix2010] Heritrix Web Crawler - http://crawler.archive.org/. 2010.
[NutchWAX2009] NutchWAX - http://archive-access.sourceforge.net/projects/nutch/. 2009.
[WERA2007] WERA - http://archive-access.sourceforge.net/projects/wera/. 2007.
[BCC2010] Büttcher, S., Clarke, C. L., & Cormack, G. V. (2010). Information Retrieval: Implementing and Evaluating Search Engines. MIT Press.
|