CS297 Proposal
Extending Yioop! abilities to search the Invisible Web
Tanmayee Potluri (tanmayee.4170@gmail.com)
Advisor: Dr. Chris Pollett
Description:
This project aims to add to the Yioop search engine the ability to crawl and index the Invisible Web. The Invisible Web refers to the information like database content, non-text files, password restricted sites, etc. on the Web. One source of dark content on the Web is URL shortening services. It is a source of dark content as in ranking results search engines often attribute the link to the URL short link service rather than to where the link points to. I will study how Yioop deals with these links and add to Yioop the ability to rank these pages correctly. Sphinx is an open source search engine which has the ability to index database content. I will study how Sphinx does this and extend Yioop to have the same ability to crawl the database content as Sphinx. Also, I will work on adding to Yioop the ability to crawl password restricted sites provided the password is given. Solving these problems will extend the ability of Yioop to search into the Invisible Web.
Schedule:
Week 1-2:
08/29/2011 - 09/10/2011 | Discuss the project in detail with the advisor |
Week 3:
09/11/2011 - 09/18/2011 | Download and study Yioop code and perform experiments with it |
Week 4:
09/19/2011 - 09/26/2011 | Deliverable 1 due: Study how URL shortening service links are handled by Yioop and associate the short links correctly with the original page.
| Week 5:
09/27/2011 - 10/3/2011 | Download and study Sphinx code |
Week 6:
10/4/2011 - 10/11/2011 | Deliverable 2 due: Perform experiments with Sphinx to index a newly created database |
Week 7-8:
10/12/2011 - 10/25/2011 | Explore various page ranking algorithms |
Week 9:
10/26/2011 - 10/1/2011 | Deliverable 3 due: Study various page ranking algorithms that can be used with URL shorteners. |
Week 10-11:
10/2/2011 - 11/16/2011 | Study how to crawl password restricted sites |
Week 12:
11/17/2011 - 11/24/2011 | Deliverable 4 due: Sample scripts for Yioop to crawl password restricted sites |
Week 13-14:
11/25/2010 - 12/8/2010 | Work on CS297 Report |
Week 15:
12/9/2010 - 12/14/2010 | Deliverable 5 due: CS297 Report |
Deliverables:
The full project will be done when CS298 is completed. The following will
be done by the end of CS297:
1. Study how URL shortening service links are handled by Yioop and associate the short links correctly with the original page.
2. Perform experiments with Sphinx to index a newly created database
3. Study various page ranking algorithms that can be used with URL shorteners.
4. Modify Yioop to crawl password restricted sites.
5. CS297 Report
References:
[1] Accessing the Deep Web. Bin He, Mitesh Patel, Zhen Zhang, Kevin Chen-Chuan Chang. ACM. 2007.
[2] http://en.wikipedia.org/wiki/Invisible_Web#Accessing, August 22, 2011.
[3] http://www.fridaytrafficreport.com/exploring-the-deep-web/, August 21, 2011.
[4] http://en.wikipedia.org/wiki/Sphinx_(search_engine), August 28, 2011.
|