CS298 Proposal

CS298 Proposal

Extending Yioop! abilities to search the Invisible Web

Tanmayee Potluri(tanmayee.4170@gmail.com)

Advisor:Dr.Chris Pollett

Committee Members:Dr. Sami Khuri and Prof. Frank Butt

Abstract:

This project aims to add to the Yioop search engine the ability to crawl and index the Invisible Web. The Invisible Web refers to the information like database content, non-text files, password restricted sites, etc. on the Web. One source of dark content on the Web is URL shortening services. It is a source of dark content as in ranking results search engines often attribute the link to the URL short link service rather than to where the link points to. I will study how Yioop deals with these links and add to Yioop the ability to associate these links with the original links. Sphinx is an open source search engine which has the ability to index database content. I will study how Sphinx does this and extend Yioop to have the same ability to crawl the database content as Sphinx. I will work on adding to Yioop the ability to crawl password restricted sites provided the password is given. Also, I will work on adding extension to Yioop to extract URL links from Javascript links on websites and also the ability to overwrite robots.txt files. Solving these problems will extend the ability of Yioop to search into the Invisible Web.

CS297 Results

  • Performed a simple crawl in Yioop to understand the inner workings of Yioop.
  • Performed a crawl on a bit.ly link to understand the behavior of Yioop with shortened links.
  • Created a patch for Yioop in order to make it work correctly with shoretened links.
  • Tested Sphinx search server with my own database to understand the working of Sphinx.
  • Explored Google's and Yioop's Page Rank algorithms to suggest a method to rank short links in Yioop.

Proposed Schedule

Week 1: 01/26/2012 - 02/1/2012Discuss the project in detail with the advisor
Week 2-3: 02/02/2012 - 02/15/2012Study how to extract URLs from JavaScript
Week 4: 02/16/2012 - 02/22/2012Deliverable 1 due: Software to extract URLs from JavaScript
Week 5: 02/23/2012 - 03/1/2012Study how to crawl password restricted sites
Week 6-7: 03/2/2012 - 03/15/2012Index content of websites if the login and password values are provided
Week 8: 03/16/2012 - 03/22/2012Deliverable 2 due: Extension for Yioop to crawl password restricted sites
Week 9-10: 03/23/2012 - 04/05/2012Study how to index database content
Week 11: 04/06/2012 - 04/12/2012Deliverable 3 due:Extension for Yioop to index database content
Week 12: 04/13/2012 - 04/19/2012Test all the extensions created and write a document on the tests and results
Week 13: 04/20/2012 - 04/26/2012Work on CS298 Report
Week 14: 04/27/2012 - 05/03/2012CS298 Report first draft- Submit to Advisor and Committee
Week 15: 05/04/2012 - 05/10/2012CS298 Report final document- Submit to Advisor and Committee
Week 16: 04/11/2012 - 05/17/2012Defense

Key Deliverables:

  • Software
    • Deliverable 1:Software to extract URLs from JavaScript.
    • Deliverable 2:Extension for Yioop to crawl password restricted sites.
    • Deliverable 3:Extension for Yioop to index database content.
  • Report
    • CS298 Report
    • Project Code and Test results Documentation

Innovations and Challenges

  • Extracting URL links from JavaScript is challenging as it involves complete parsing of the Javascript links.
  • Crawling password restricted sites and database content is innovative as this would uncover a plethora of information from the Web.

References

[1] Accessing the Deep Web. Bin He, Mitesh Patel, Zhen Zhang, Kevin Chen-Chuan Chang. ACM. 2007.

[2] Source: http://en.wikipedia.org/wiki/Invisible_Web#Accessing, Retrieved August 22, 2011.

[3] Source: http://sphinxsearch.com/docs/2.0.1/, Retrieved December 2, 2011.

[4] Google’s Page Rank and Beyond by Amy N.Langville and Carl D.Meyer

[5] Source: http://www.webworkshop.net/pagerank.html#how_is_pagerank_calculated, Retrieved October 24, 2011.

[6] Source: http://en.wikipedia.org/wiki/PageRank, Retrieved October 24, 2011.

[7] Source: http://en.wikipedia.org/wiki/URL_shortening, Retrieved October 15, 2011.