Deliverable 3

The deliverable 3 provides implementation details related to Nutch search engine. Nutch is an open source search engine. In this deliverable, Nutch was implemented on a Windows Vista 32-bit machine. This implementation helped to understand the various modules that exist in a search engine. This understanding will be useful in implementing the final solution in CS 298.

The Nutch search engine consists of three components:
  1. The Crawler, which discovers and retrieves web pages.
  2. The 'WebDB', a custom database that stores known URLs and fetched page contents.
  3. The 'Indexer', which dissects pages and builds keyword-based indexes from them.

Please refer to the below slides for further information related to Nutch implementation

    [Implementation details of NUTCH search engine - PDF]