Deliverable 1: Experiment with Heritrix
Description: Did study about heritrix, built project source code and got it to run and do sample crawls.
Obtaining Source Code: Download Source Code for Heritrix from sourceforge svn url using following command: svn co https://archive-crawler.svn.sourceforge.net/svnroot/archive-crawler/release-branches/heritrix-1.14.4 heritrix-1.14.4
- Heritrix can be built from source using Maven. Note: Do not use Maven 2.x
- Setup the required plugins as described in Developer's Manual of Heritrix, point 2.2.
- Go to the heritrix-1.14.4 folder we have built: cd heritrix-1.14.4
- Run maven command to start building: maven dist
- If any error occurs, follow the onscreen instruction and download the required package for maven manually.
- Go to heritrix-1.14.4/bin directory and type following command to launch web-interface for heritrix: heritrix --admin=LOGIN:PASSWORD
- Here, LOGIN=admin, PASSWORD=letmein
- Launch Web Browser and go to following address to access web based user interface for Heritrix: http://127.0.0.1:8080