CS298 Proposal
CS298 Proposal
Extending Yioop! abilities to search the Invisible Web
Tanmayee Potluri(tanmayee.4170@gmail.com)
Advisor:Dr.Chris Pollett
Committee Members:Dr. Sami Khuri and Prof. Frank Butt
Abstract:
This project aims to add to the Yioop search engine the ability to crawl and index the Invisible Web. The Invisible Web refers to the information like database content, non-text files, password restricted sites, etc. on the Web. One source of dark content on the Web is URL shortening services. It is a source of dark content as in ranking results search engines often attribute the link to the URL short link service rather than to where the link points to. I will study how Yioop deals with these links and add to Yioop the ability to associate these links with the original links. Sphinx is an open source search engine which has the ability to index database content. I will study how Sphinx does this and extend Yioop to have the same ability to crawl the database content as Sphinx. I will work on adding to Yioop the ability to crawl password restricted sites provided the password is given. Also, I will work on adding extension to Yioop to extract URL links from Javascript links on websites and also the ability to overwrite robots.txt files. Solving these problems will extend the ability of Yioop to search into the Invisible Web.
CS297 Results
- Performed a simple crawl in Yioop to understand the inner workings of Yioop.
- Performed a crawl on a bit.ly link to understand the behavior of Yioop with shortened links.
- Created a patch for Yioop in order to make it work correctly with shoretened links.
- Tested Sphinx search server with my own database to understand the working of Sphinx.
- Explored Google's and Yioop's Page Rank algorithms to suggest a method to rank short links in Yioop.
Proposed Schedule
Week 1:
01/26/2012 - 02/1/2012 | Discuss the project in detail with the advisor |
Week 2-3:
02/02/2012 - 02/15/2012 | Study how to extract URLs from JavaScript |
Week 4:
02/16/2012 - 02/22/2012 | Deliverable 1 due: Software to extract URLs from JavaScript
| Week 5:
02/23/2012 - 03/1/2012 | Study how to crawl password restricted sites |
Week 6-7:
03/2/2012 - 03/15/2012 | Index content of websites if the login and password values are provided |
Week 8:
03/16/2012 - 03/22/2012 | Deliverable 2 due: Extension for Yioop to crawl password restricted sites |
Week 9-10:
03/23/2012 - 04/05/2012 | Study how to index database content |
Week 11:
04/06/2012 - 04/12/2012 | Deliverable 3 due:Extension for Yioop to index database content |
Week 12:
04/13/2012 - 04/19/2012 | Test all the extensions created and write a document on the tests and results |
Week 13:
04/20/2012 - 04/26/2012 | Work on CS298 Report |
Week 14:
04/27/2012 - 05/03/2012 | CS298 Report first draft- Submit to Advisor and Committee |
Week 15:
05/04/2012 - 05/10/2012 | CS298 Report final document- Submit to Advisor and Committee |
Week 16:
04/11/2012 - 05/17/2012 | Defense |
Key Deliverables:
- Software
- Deliverable 1:Software to extract URLs from JavaScript.
- Deliverable 2:Extension for Yioop to crawl password restricted sites.
- Deliverable 3:Extension for Yioop to index database content.
- Report
- CS298 Report
- Project Code and Test results Documentation
Innovations and Challenges
- Extracting URL links from JavaScript is challenging as it involves complete parsing of the Javascript links.
- Crawling password restricted sites and database content is innovative as this would uncover a plethora of information from the Web.
References
[1] Accessing the Deep Web. Bin He, Mitesh Patel, Zhen Zhang, Kevin Chen-Chuan Chang. ACM. 2007.
[2] Source: http://en.wikipedia.org/wiki/Invisible_Web#Accessing, Retrieved August 22, 2011.
[3] Source: http://sphinxsearch.com/docs/2.0.1/, Retrieved December 2, 2011.
[4] Google’s Page Rank and Beyond by Amy N.Langville and Carl D.Meyer
[5] Source: http://www.webworkshop.net/pagerank.html#how_is_pagerank_calculated, Retrieved October 24, 2011.
[6] Source: http://en.wikipedia.org/wiki/PageRank, Retrieved October 24, 2011.
[7] Source: http://en.wikipedia.org/wiki/URL_shortening, Retrieved October 15, 2011.
|