CS297-298 Project News Feed

Sep 15, 2009

After having been away for a very long time, mainly due to maternity, I am back! I now have a 5 month old baby girl and I think it is time I got back to my completing my project work.

Where I left is: Given a URL, I used Nutch to crawl the site up to a certain depth and fetched pages not exceeding 1000. From the list of pages, I fetched each page using cURL. Once the page is fetched, my code looked for links to images on the page and got each image individually and inserted unicode data for the image in place of the link.

What needs to be done: Look for links to other pages in each page, convert that to unicode data and then convert the rest of the data on the page, which is text, to unicode. Only links to pages that already have been crawled need to be considered. Once each page has been handled this way, it is stored in the an array such that
a[Page 1 URL] -> [Unicode of Page 1]
                     :
                     :
                     :
a[Page n URL] -> [Unicode of Page n]

To get things rolling again, I have set a small goal. Where I had stopped, my program read only one page from the list of crawled pages from Nutch and fetched the first page. As a small increment to my existing code, I will read the entire list of crawled pages and output the list.

Apr 01

Mar 25

Spring break.

Mar 18

Other than scanning the site iteratively until I have scanned up to 1000 pages, everything to do with crawling the site is done. While running the program, the execution times set in php.ini are sometimes sufficient and sometimes they are not.

From the list of scanned pages, I have to get each of the pages using cURL() and start working on base64 encoding the data.

Mar 11

Have still to finish up the previous code. Use cURL() to actually get each individual page since I'm not able to read the fetched data.

Mar 04

Today I was supposed to have demo'ed a working deliverable. Take URL, scan until a specified depth and then fetch all the pages that were crawled. But code broke down when I added a series of Nutch commands. Should fix this and get it working as soon as possible.

Feb 26

Continue working on what was discussed until now...

Feb 19

up to 1000pgs or 2GB (set topN to be 1000) Crawl to depth 1 and see if you got 1000pgs if not go to depth 2 ... if not go to depth 3.... so on....
SO- limit total crawled size to 2GB.
Talked about fetch---- Find out how to use 'fetch' command.

Feb 12

Take email address of the user so that the user needn't wait while the data is being crawled. We can inform him/her via email. Provide a link in the email from where final data can be accessed. (design issue)
(send email saying that the site is being crawled... and that email should have a link to the site where the data is saved...)
(give immediate response back.)
Just having a progress bar won't be enough.
Find out how long it takes to crawl a site up to depth 5 (for now)
Consider the size of crawled data because we have to present the data finally to the user. Check if the size reduces when it is zipped. update blog. name of project(website2go).

Feb 7

As the first step, do something on the lines of what I did in Deliverable 2 of last semester. This time I should be able to crawl any reasonable sized website. In the previous deliverable I crawled a small site hosted on my computer.

Jan 29

Could not meet with professor.

Jan 24

Met Dr. Pollett and went through the first draft of the proposal for CS298.

Start of Spring 2008 - CS298

Dec 11

Dec 4

Showed Del 4 but haven't uploaded the .zip file yet. My code works for images with complete links; should extend that to be able to get images with relative links too. For next week I should have my 297 report which is my last deliverable.

Nov 27

Showed part of Del 4 where I used base64_encode to encode strings and images. Should extend this so that my code takes in a URL as a parameter and gets the image files that the page might have(using curl) and then base64 encodes these image files and displays them back in the browser.
We also spoke about a few things about what is to be done in the next semester.
-I need to learn how to create a cache of a page or what things I need to create a cache of a page.

Nov 20

Make changes to Deliverable 2 to display code more appropriately

Write php code to show usage of base64_encode() and prepare a presentation on data URI schemes

Nov 13

For next week: Read on what should be in the data URI scheme.

Update Blog:

Put up Del 2
More low level ppt on Obfuscation. Show the 4 main classifications. For each of the classifications, pick one or two relevant (to JavaScript) sub classifications and show how the transformation changes a JavaScript code snippet.
Pick a tool that we will use for obfuscation

Tool picked for now is the Stunnix JavaScript Obfuscator. But I would like to try an open source obfuscator that I found recently. Find two available tools and describe which obfuscation techniques each of them use.

Most commercially available JavaScript obfuscators target at transforming the lexical structure of the original code.

Nov 6

Met on Nov 7th. Demo'ed the final implementation of the PHP script that executes Nutch on the site that is specified. Nutch crawls the site and stores data in its databases. The PHP script then reads from the database and displays relevant information in the browser. More Nutch commands can be executed from the script to read specific data.

Oct 30

Discussed more on how to go about to get Nutch being executed from the PHP script. Met Dr. Stamp to get information on Code Obfuscation Techniques. Dr. Stamp pointed out to one particular paper which had the needed info.

Oct 23

I have to finish my web application where when a user inputs a url, I crawl the site using Nutch and display the crawl results.

I have to meet with Dr. Stamp and find out about some code obfuscation techniques that can be used for javascript.

Oct 16

Learn how to read the data that is gathered by Nutch. Find out the contents of each of the sub-directories created by Nutch. Write a PHP program to have a page with a URL specifying text box and a Fetch button. The result should display a list of links resulted by the crawl.

Oct 11

Got Nutch installed. Learned that having Nutch to crawl two sites was not very simple. I was confused about how I would have two hosts on my single server. I was thinking I had to have multiple IP addresses but it is not so. I should have them as virtual hosts in the config file and specify multiple host names.

Based on what I experienced, we decided to first have Nutch crawl a single site as described in the tutorial on the website. Next have Nutch crawl a single site on my localhost and then have multiple hosts (say two) and as a third stage, crawl these two sites.

Oct 2

Upload deliverables. Put links to the deliverables in index file. Download Nutch and have it crawl some site. - For this, have two hosts on one server and see how Nutch goes about crawling them.

Sept. 25

Did not meet.

Sept. 18

Finished my first JavaScript program. Next- read about web crawlers. Give a presentation on Nutch (an open source crawler) Coding assignment: Write PHP code to edit graphics/images.

Sept. 11

Discussed about the JavaScript code that I had to work on.

Sept. 4

Edit my blog entries. Write a small bio about myself. Write a JavaScript that has a small black box on a page. The box should move top,down, left and right according to the user input. There should four buttons "top", "down", "left" and "right" that help achieve this. There should also be a submit button. When this button is pressed a new page open which displays the x and y coordinates of the latest position of the black box. The black box should be defined using only <div> tags and not <img>.