Chris Pollett > Students >
Tanmayee

    ( Print View )

    [Bio]

    [Project Blog]

    [CS297 Proposal]

    [Deliverable 1]

    [Deliverable 2]

    [Deliverable 3-PDF]

    [CS297 Report-PDF]

    [CS298 Proposal]

    [CS298 Spring 2012 Progress Report-PDF]

    [CS298 Project Report-PDF]

    [CS298 Presentation-PDF]

                          

























CS297 Proposal

Extending Yioop! abilities to search the Invisible Web

Tanmayee Potluri (tanmayee.4170@gmail.com)

Advisor: Dr. Chris Pollett

Description:

This project aims to add to the Yioop search engine the ability to crawl and index the Invisible Web. The Invisible Web refers to the information like database content, non-text files, password restricted sites, etc. on the Web. One source of dark content on the Web is URL shortening services. It is a source of dark content as in ranking results search engines often attribute the link to the URL short link service rather than to where the link points to. I will study how Yioop deals with these links and add to Yioop the ability to rank these pages correctly. Sphinx is an open source search engine which has the ability to index database content. I will study how Sphinx does this and extend Yioop to have the same ability to crawl the database content as Sphinx. Also, I will work on adding to Yioop the ability to crawl password restricted sites provided the password is given. Solving these problems will extend the ability of Yioop to search into the Invisible Web.

Schedule:

Week 1-2: 08/29/2011 - 09/10/2011Discuss the project in detail with the advisor
Week 3: 09/11/2011 - 09/18/2011Download and study Yioop code and perform experiments with it
Week 4: 09/19/2011 - 09/26/2011Deliverable 1 due: Study how URL shortening service links are handled by Yioop and associate the short links correctly with the original page.
Week 5: 09/27/2011 - 10/3/2011Download and study Sphinx code
Week 6: 10/4/2011 - 10/11/2011Deliverable 2 due: Perform experiments with Sphinx to index a newly created database
Week 7-8: 10/12/2011 - 10/25/2011Explore various page ranking algorithms
Week 9: 10/26/2011 - 10/1/2011Deliverable 3 due: Study various page ranking algorithms that can be used with URL shorteners.
Week 10-11: 10/2/2011 - 11/16/2011Study how to crawl password restricted sites
Week 12: 11/17/2011 - 11/24/2011Deliverable 4 due: Sample scripts for Yioop to crawl password restricted sites
Week 13-14: 11/25/2010 - 12/8/2010Work on CS297 Report
Week 15: 12/9/2010 - 12/14/2010Deliverable 5 due: CS297 Report

Deliverables:

The full project will be done when CS298 is completed. The following will be done by the end of CS297:

1. Study how URL shortening service links are handled by Yioop and associate the short links correctly with the original page.

2. Perform experiments with Sphinx to index a newly created database

3. Study various page ranking algorithms that can be used with URL shorteners.

4. Modify Yioop to crawl password restricted sites.

5. CS297 Report

References:

[1] Accessing the Deep Web. Bin He, Mitesh Patel, Zhen Zhang, Kevin Chen-Chuan Chang. ACM. 2007.
[2] http://en.wikipedia.org/wiki/Invisible_Web#Accessing, August 22, 2011.
[3] http://www.fridaytrafficreport.com/exploring-the-deep-web/, August 21, 2011.
[4] http://en.wikipedia.org/wiki/Sphinx_(search_engine), August 28, 2011.