|
Chris Pollett >
Students > [Bio] [CS298 Spring 2011 - Progress Report] |
Deliverable 2: Modify Source Code of HeritrixDescription: Modified source code of Heritrix to store address of robots.txt for all files. Understand Source Code: Understood the source code of Heritrix so that it is easier to find exact place for modification to make the desired changes in behavior of Heritrix. To achieve desired change in Heritrix arc file, I added following code in write method of ARCWriterProcessor class located in org.archive.crawler.writer package. This code includes offset address of robot.txt for that site while storing meta data for that web-page:
String[] temp = curi.toString().split("/");
String tmp = "";
long robotPosition = 0;
flag = false;
for(int i = 0; i < temp.length - 1 && i < 3; i++)
{
tmp += temp[i]+"/";
}
if(temp[temp.length-1].equalsIgnoreCase("robots.txt"))
{
flag = true;
robotTable.put(tmp, new Long(position));
}
System.out.println("position: " + position);
if(!flag){
for(String key: robotTable.keySet())
{
System.out.println("key: " + key + " vs " + tmp + "\n" + robotTable.get(key));
if(tmp.equalsIgnoreCase(key))
{
System.out.println("Robot: " + robotTable.get(key));
robotPosition = robotTable.get(key);
}
}
}
|