Chris Pollett > Students >
Darshan

    ( Print View )

    [Bio]

    [Project Blog]

    [CS297 Proposal]

    [Deliverable 1]

    [Deliverable 2]

    [Deliverable 3]

    [Deliverable 4]

    [CS297 Project Report - PDF]

    [CS298 Proposal]

    [CS298 Spring 2011 - Progress Report]

    [CS298 Report]

    [CS298 Presentation]

    [CS298 Project Code]

                          

























Deliverable 2: Modify Source Code of Heritrix

Description: Modified source code of Heritrix to store address of robots.txt for all files.

Understand Source Code: Understood the source code of Heritrix so that it is easier to find exact place for modification to make the desired changes in behavior of Heritrix.

To achieve desired change in Heritrix arc file, I added following code in write method of ARCWriterProcessor class located in org.archive.crawler.writer package. This code includes offset address of robot.txt for that site while storing meta data for that web-page:

String[] temp = curi.toString().split("/");
String tmp = "";
long robotPosition = 0;
flag = false;
for(int i = 0; i < temp.length - 1 && i < 3; i++)
{
	tmp += temp[i]+"/";
}
if(temp[temp.length-1].equalsIgnoreCase("robots.txt"))
{
	flag  = true;
	robotTable.put(tmp, new Long(position));
}
System.out.println("position: " + position);
if(!flag){
	for(String key: robotTable.keySet())
	{
		System.out.println("key: " + key + " vs " + tmp + "\n" + robotTable.get(key));
		if(tmp.equalsIgnoreCase(key))
		{
			System.out.println("Robot: " + robotTable.get(key));
			robotPosition = robotTable.get(key);
		}
	}
}