Chris Pollett > Students >
Darshan

    ( Print View )

    [Bio]

    [Project Blog]

    [CS297 Proposal]

    [Deliverable 1]

    [Deliverable 2]

    [Deliverable 3]

    [Deliverable 4]

    [CS297 Project Report - PDF]

    [CS298 Proposal]

    [CS298 Spring 2011 - Progress Report]

    [CS298 Report]

    [CS298 Presentation]

    [CS298 Project Code]

                          

























CS298 Proposal

Full Text Indexing for Heritrix

Darshan Karia (darshan.karia@students.sjsu.edu)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Mark Stamp, Dr. Jeff Smith.

Abstract:

My project mainly focuses on Indexing of the archive (ARC) files. I will develop a Java module to perform indexing on these files. I will use readily available archive files to test indexing performance of the module, once indexing module is ready. I will also do performance benchmarking for this indexer and compare the results with various other indexers.

To make this indexing worth using, I will also develop a user interface for querying archive via this index in form of search query and displaying results based on that. Once the whole project is ready, it can be used for indexing large and personalized crawls archive files generated by heritrix [Heritrix2010].

CS297 Results

  • Crawled web using Heritrix and generated sample archive files.
  • Modified format of archive file with addition of robot.txt file location for that domain.
  • Saw the capabilities of NutchWAX [NutchWAX2009] and WERA [WERA2007].
  • Learnt about full-text indexing and inverted indexes from the book titled "Information Retrieval" [BCC2010].

Proposed Schedule

Week 1: Feb.01-Feb.07Proposal
Week 2: Feb.08-Feb.14Work on Index Structure Design
Week 3: Feb.15-Feb.21Finalize Index Structure Design
Week 4: Feb.22-Feb.28Deliverable 1 Due: Design Indexing Scheme, which helps in faster retrieval of data from many large archive files
Week 5: March.01-March07Initial Development for Indexing Module
Week 6: March.08-March.14Finalize Development
Week 7: March.15-March.21Deliverable 2 Due: Develop Indexing Module, which actually performs indexing operation on archive files and stores inverted index for words in documents to make word based query for searching
Week 8: March.22-March.28Do Performance Testing with larger files to index
Week 9: March.29-April.03Spring Recess
Week 10: April.04-April.10Deliverable 3 Due: Performance Benchmarking against various other indexers using online resources
Week 11: April.11-April.17Develop Search Module
Week 12: April.18-April.24Deliverable 4 Due: Develop Search Module, which can respond to search query of user and retrieve result from archive files using the index developed by Indexing Module
Week 13: April.25-May.01Start Writing Report
Week 14: May.02-May.08Finalize Report
Week 15: May.09-May.15Defense.

Key Deliverables:

  • Software
    • Deliverable 1: Design Indexing Scheme, which helps in faster retrieval of data from many large archive files
    • Deliverable 2: Develop Indexing Module, which actually performs indexing operation on archive files and stores inverted index for words in documents to make word based query for searching
    • Deliverable 3: Performance Benchmarking against various other indexers using online resources
    • Deliverable 4: Develop Search Module, which can respond to search query of user and retrieve result from archive files using the index developed by Indexing Module
  • Report
    • CS298 Report

Innovations and Challenges

  • Main challenge in this project is to come up with an efficient indexing format.
  • Design search interface with JSP and communicate with Java classes to retrieve search results from archive files.

References:

[Heritrix2010] Heritrix Web Crawler - http://crawler.archive.org/. 2010.

[NutchWAX2009] NutchWAX - http://archive-access.sourceforge.net/projects/nutch/. 2009.

[WERA2007] WERA - http://archive-access.sourceforge.net/projects/wera/. 2007.

[BCC2010] Büttcher, S., Clarke, C. L., & Cormack, G. V. (2010). Information Retrieval: Implementing and Evaluating Search Engines. MIT Press.