Chris Pollett > Students >
Kukreti

    ( Print View )

    [Bio]

    [CS297/298 Blog]

    [CS297 Proposal]

    [Reading-The Internet Archive-PDF.]

    [Reading-Chapter 15 and 16 [Buttcher 10]-PDF.]

    [Reading-Chapter 13 [Buttcher 10]-PDF.]

    [Reading-Entity Tags-PDF.]

    [Deliverable 1]

    [Deliverable 2]

    [Deliverable 3]

    [Deliverable 4]

    [CS297 Report-PDF.]

    [CS298 Proposal]

    [CS298 Report-PDF]

    [CS298 Presentation-PDF]

    [CS298 Code]

    [Graduation Pic]

                          

























CS298 Proposal

Yioop! Full Historical Indexing in Cache Navigation

Akshat Kukreti (akshatkukreti@gmail.com)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Chris Tseng, Dr. Soon Tee Teoh

Abstract:

Search engines often maintain a cache with copies of web pages downloaded by the search engine crawler. The link to the cached version of a web page is displayed along with the query results. The aim of this project is to add new cache related features to the open source search engine Yioop!. The first aim of this project is to develop a feature that enables users to navigate through the entire history of cached results of a particular web page. This feature will be similar to the Internet Archive as it will give the user, access to all snapshots of a query result. As the content of the web changes, search engines must also update their index accordingly. The indexing process takes time and, if the indexing is done after crawling for a long duration, the search engine index may become stale causing the cache to also become stale. Commercial search engines like Google update parts of their index more frequently so that they stay fresh. The second aim of the project is to develop a feature that updates the search index incrementally along with the associated cache. One way to check if a web resource has changed is to use Entity Tags. An Entity Tag is a unique identifier associated with a web resource. The second feature would use entity tags for index updation and cache invalidation. The third aim of the project is to come up at least two algorithms for using ETags with the crawl process. The algorithms will be implemented and the impact on crawl speed and bandwidth (size of pages downloaded) will be compared to determine which approach is better.

CS297 Results

  • Modified Yioop!'s cache request and output mechanism so that if a cached version for a web page is not found for a given web page in a specified timestamp, a search is done for a timestamp that is nearest in time to the specified timestamp, and contains a cached version of the desired web page.
  • Modified the links present in cached versions of web pages so that they follow the method described in Result 1.
  • Developed a UI that shows links to all cached versions of a web page when showing a cached version.
  • Performed experiments with ETags using PHP and cURL. Noted results obtained by making requests with different ETag headers and values.

Proposed Schedule

Week 1: Jan 29 - Feb 5, 2013Discuss project in detail with Advisor and write CS298 proposal
Week 2: Feb 5 - Feb 12, 2013Work on History feature for navigating through cached web page history
Week 3: Feb 12 - Feb 19, 2013Deliverable 1: History feature due
Week 4: Feb 19 - Feb 26, 2013Research ways for saving Entity Tags
Week 5: Feb 26 - Mar 5, 2013Read and Understand Yioop!'s Queue server program code
Week 6: Mar 5 - Mar 12, 2013Work on two algorithms that use Entity Tags for cache invalidation
Week 7: Mar 12 - Mar 19, 2013Work on two algorithms that use Entity Tags for cache invalidation
Week 8: Mar 19 - Mar 26, 2013Work on algorithms that use Entity Tags for cache invalidation
Week 9: Mar 26 - Apr 2, 2013Deliverable 2: Test Implementation of Algorithms on cache invalidation due
Week 10: Apr 2 - Apr 9, 2013Experiment with Algorithms and report impact on crawl speed and bandwidth
Week 11: Apr 9 - Apr 16, 2013Experiment with Algorithms and report impact on crawl speed and bandwidth
Week 12: Apr 16 - Apr 23, 2013Deliverable 3: Report comparing results from Algorithms due
Week 13: Apr 23 - Apr 30, 2013Work on CS298 report
Week 14: Apr 30 - May 7, 2013Submit first draft of CS298 report to Advisor and Committee
Week 15: May 7 - May 14, 2013Submit final draft of CS298 report to Advisor and Committee
Week 16: May 14 - May 21, 2013Defense

Key Deliverables:

  • Software
    • Implementation of algorithm in Yioop! for navigating through the entire web page caches.
    • Implementation of a history feature that enables users to see the history of a cached web page.
    • Implementation in Yioop!, of two algorithms that use ETags for cache invalidation.
  • Report
    • CS298 Report
    • Software Documentation
    • Comparison of results obtained from the implemented algorithms in terms of crawl speed and bandwidth usage.

Innovations and Challenges

  • Implementing a feature for navigating through history of cached pages is innovative as most commercial search engines show only the latest cached version of the web page. The feature will act as an archive of cached pages.
  • Software that use Entity Tags are often proprietary. In this project, Entity Tags will be used in a system (Yioop!) that is open source.
  • Managing a large number of Entity Tags without slowing down the crawl speed is a challenge.

References:

[Buttcher 2010] Information Retrieval: Implementing and Evaluating Search Engines. Stefan Buttcher, Charles L.A. Clarke, Gordon V. Cormack. The MIT Press. 2010

[Schrenk 2012] Webbots, Spiders and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL. Michael Schrenk. No Starch Press. 2012

[Enbody 2012] Dynamic Hashing Schemes. R.J. Enbody, H.C. Du. http://www.dcc.unicamp.br/~celio/mc326/hashing/dynamic-hashing-enbody.pdf. 2012

[Blanco 2010] Caching Search Engine Results over Incremental Indices. Roi Blanco, Edward Bortnikov, Flavio P. Junquiera, Ronny Lempel, Luca Telloli, Hugo Zaragoza http://research.yahoo.com/pub/3234 2010