Chris Pollett > Students >
Gargi

    ( Print View)

    [Bio]

    [Blog]

    [C297 Proposal]

    [Paper 1: Large-scale IRLBot crawl PDF]

    [Paper 2: Distributed Crawler Architecture PDF]

    [Paper 3: Scalability Challenges PDF]

    [Paper 4: High Performance Priority Queues PDF]

    [Deliverable 1: Yioop Ranking Mechanisms PDF]

    [Deliverable 2(ii): Modifying Yioop's UI Editor]

    [Deliverable 3: Modifying Yioop's queuing process]

    [Deliverable 4: Yandex Signal PDF]

    [CS 297 Report PDF]

    [C298 Proposal]

    [C298: Yandex-inspired Search Factors]

    [C298: Latest Page Version in SERP]

    [C298: Disjunctive Queries in Yioop Search]

    [CS 298 Report PDF]

    [CS 298 Report Slides PDF]

CS297 Proposal

Enhancing the queuing process for Yioop's scheduler

Gargi Sheguri (gargi.sheguri@sjsu.edu)

Advisor: Dr. Chris Pollett

Description:

Yioop is an open-source, PHP-based search engine. Yioop's crawling mechanism makes use of a set of distributed queue servers and fetchers to generate (and rank) search results. The working assumption is that webpages are indexed in order of importance, such that the most important webpages are fetched first. In case a fetcher retrieval fails, or performs slower than another fetcher and induces a time lag in persisting the results, the webpages might be indexed out of order. As a result, important documents could be presented lower in order/excluded from the final index. The aim of this project is to modify the queuing process to maintain the expected order of search results.

Schedule:

Week 1: Jan 31 - Feb 7Finalize project topic and deliverables
Week 2: Feb 7 - Feb 14Work on Deliverable#1: Understand Yioop's ranking mechanisms and queuing process, related files in codebase
Week 3: Feb 14 - Feb 21Complete and summarize Deliverable#1, read [1]
Week 4: Feb 21 - Feb 28Start working on Deliverable#2: Understand and fix Yioop bug
Week 5: Feb 28 - Mar 7Work on Deliverable#2, read [2] and [3]
Week 6: Mar 7 - Mar 14Work on Deliverable#2
Week 7: Mar 14 - Mar 21Complete and summarize Deliverable#2
Week 8: Mar 21 - Mar 28Start working on Deliverable#3: Modifying Yioop's queuing process to retain expected index order
Week 9: Mar 28 - Apr 4Work on Deliverable#3
Week 10: Apr 4 - Apr 11Work on Deliverable#3
Week 11: Apr 11 - Apr 18Work on Deliverable#3
Week 12: Apr 18 - Apr 25Complete and summarize Deliverable#3
Week 13: Apr 25 - May 2Start working on Deliverable#4: Study Yandex and incorporate one signal into Yioop, read [4]
Week 14: May 2 - May 9Complete and summarize Deliverable#4, start working on Deliverable#5: 297 Report
Week 15: May 9 - May 16Work on final report

Deliverables:

The full project will be done when CS298 is completed. The following will be done by the end of CS297:

1. Obtain the most recent copy of Yioop's source code: understand Yioop's ranking mechanisms, trace relevant code, and understand how the queuing process works.

2. Fix one outstanding bug in Yioop.

3. Modify Yioop's queue server such that the index is created by maintaining the queuing order.

4. Pick one Yandex signal and incorporate that into Yioop.

5. CS 297 report.

References:

[1] S. T. Ahmed, C. Sparkman, H. -T. Lee and D. Loguinov, "Around the web in six weeks: Documenting a large-scale crawl," 2015 IEEE Conference on Computer Communications (INFOCOM), Hong Kong, China, 2015, pp. 1598-1606, doi: 10.1109/INFOCOM.2015.7218539.

[2] Distributed web crawler architecture, by S. Severance. (2011, Dec. 15). US20110307467A1 [Online]. Available: https://patents.google.com/patent/US20110307467A1

[3] B. Cambazoglu and R. Baeza-Yates, "Scalability Challenges in Web Search Engines," in Synthesis Lectures on Information Concepts, Retrieval, and Services, vol. 7, 2011, pp. 27-50. doi: 10.1007/978-3-642-20946-8_2.

[4] M. Marin, R. Paredes, and C. Bonacic. "High-performance priority queues for parallel crawlers." In Proceedings of the 10th ACM workshop on Web information and data management, pp. 47-54. 2008.