Chris Pollett > Students >
Sheetal

    ( Print View )

    [Bio]

    [Project Blog]

    [CS297 Proposal]

    [Del1- JavaScript file]

    [Del2-Presentation on Web Crawlers-Nutch-PPT]

    [Del2-Implementation of Nutch Crawl]

    [Del3-Code Obfuscation Techniques-PPT]

    [Del4-data:URI scheme]

    [CS297Report-PDF]

    [CS298 Proposal]

    [CS298 Report-PDF]

    [CS298 Presentation Slides-PDF]

                          

























CS298 Proposal

Bookmarklet Builder for Offline Data Retrieval

Sheetal Naidu (sheetalnaidu@yahoo.com)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Mark Stamp (stamp@cs.sjsu.edu) and Dr. Tom Qi Zhang (qzhang@cs.sjsu.edu)

Abstract:

The goal for this project is to develop a tool that can save entire web page applications as bookmarklets. This will enable users to use these applications even when they are not connected to the Internet. The main technology beyond JavaScript needed to do this is the data: URI schema. This enables images, Flash, applets, PDFs, etc. to be directly embedded as base64 encoded text within a web page. All major browsers other than Internet Explorer support this URI scheme. Our Bookmarklet Builder program will be implemented as a website users can submit a URL to crawl to a certain depth. Our program then crawls the site to fetch the appropriate pages, and converts the necessary images and video files to the data: URI format. It then further converts these modified pages to the data: URI format. Finally, a JavaScript wrapper string is constructed to manage which page to display at any given time. This resulting code is obfuscated so that it can potentially be sold as a commercial product. To fetch the pages the user requests, our site will PHP back-end and make use of the open-source Nutch web spider.

CS297 Results

  • Learned JavaScript which is one of the main scripting languages used to implement this project.
  • Researched about web crawlers and picked Nutch which will be used to crawl the specified site. From the fetched data, we can extract information like the link information between pages.
  • Researched about available JavaScript obfuscators and picked Stunnix JavaScript Obfuscator for our project.
  • Learned about the data: URI scheme and implemented conversion of different data forms to the data: URI scheme.

Proposed Schedule

Week 1 and 2: Jan 23 - Feb 5Design the look and necessary features that the website will have.
Week 3, 4 and 5: Feb 6 - 26Work on the above software implementation
Week 6, 7 and 8: Feb 27 - Mar 18Work on implementing how to fetch entire websites and extract necessary information from crawled data.
Week 9, 10 and 11: Mar 19 - Apr 8Work on algorithm to convert different kinds of fetched data to the data: URI scheme.
Week 12 and 13: Apr 9 - 22Work on obfuscating the generated JavaScript program using Stunnix.
Week 14: Apr 23 - 29Code testing and bug fixing.
Week 15 and 16: Apr 30 - May 13Work on writing the report.

Key Deliverables:

  • Software
    • Build a website which will be the front-end of our project. From this website users can enter the URL of the website that will be converted to a bookmarklet.
    • Implement a program which will crawl the specified URL and fetch all the pages up to the specified depth. Extract necessary information from the crawled data so that we know how pages are inter-linked with one another.
    • Convert the fetched pages to the data: URI scheme and make them available such that each page will be displayed in a <div> tag in the appropriate place.
    • Obfuscate the resulting code.
  • Report
    • Write the final report.

Innovations and Challenges

  • Our project will enable users to browse entire websites even when offline.
  • The amount space it uses will be less than how much would be used if we save each individual page of the website.
  • There will be no need to store a cache of pages of the website too.
  • It will provide a simple way to archive entire websites.

References:

[2004] JavaScript Bible. Danny Goodman with Michael Morrison. Wiley. 2004.

[2006] JavaScript : The Definitive Guide. David Flanagan. O'Reilly. 2006.

[2006] Programming PHP. Rasmus Lerdorf, Kevin Tatroe, and Peter MacIntyre. O'Reilly. 2006.

[2005] RFC 3986. Uniform Resource Identifier (URI): Generic Syntax. Network Working Group. "http://gbiv.com/protocols/uri/rfc/rfc3986.html"

[2007] Official page of Nutch project. "http://lucene.apache.org/nutch/"

[2002] Stunnix JavaScript Obfuscator "www.stunnix.com"