Chris Pollett >
Students > [Bio] [Del2-Presentation on Web Crawlers-Nutch-PPT] [Del2-Implementation of Nutch Crawl] [Del3-Code Obfuscation Techniques-PPT] [CS298 Presentation Slides-PDF] |
CS298 ProposalBookmarklet Builder for Offline Data RetrievalSheetal Naidu (sheetalnaidu@yahoo.com) Advisor: Dr. Chris Pollett Committee Members: Dr. Mark Stamp (stamp@cs.sjsu.edu) and Dr. Tom Qi Zhang (qzhang@cs.sjsu.edu) Abstract:The goal for this project is to develop a tool that can save entire web page applications as bookmarklets. This will enable users to use these applications even when they are not connected to the Internet. The main technology beyond JavaScript needed to do this is the data: URI schema. This enables images, Flash, applets, PDFs, etc. to be directly embedded as base64 encoded text within a web page. All major browsers other than Internet Explorer support this URI scheme. Our Bookmarklet Builder program will be implemented as a website users can submit a URL to crawl to a certain depth. Our program then crawls the site to fetch the appropriate pages, and converts the necessary images and video files to the data: URI format. It then further converts these modified pages to the data: URI format. Finally, a JavaScript wrapper string is constructed to manage which page to display at any given time. This resulting code is obfuscated so that it can potentially be sold as a commercial product. To fetch the pages the user requests, our site will PHP back-end and make use of the open-source Nutch web spider. CS297 Results
Proposed Schedule
Key Deliverables:
Innovations and Challenges
References:[2004] JavaScript Bible. Danny Goodman with Michael Morrison. Wiley. 2004. [2006] JavaScript : The Definitive Guide. David Flanagan. O'Reilly. 2006. [2006] Programming PHP. Rasmus Lerdorf, Kevin Tatroe, and Peter MacIntyre. O'Reilly. 2006. [2005] RFC 3986. Uniform Resource Identifier (URI): Generic Syntax. Network Working Group. "http://gbiv.com/protocols/uri/rfc/rfc3986.html" [2007] Official page of Nutch project. "http://lucene.apache.org/nutch/" [2002] Stunnix JavaScript Obfuscator "www.stunnix.com" |