CS297 Proposal

Enhancing the queuing process for Yioop's scheduler

Gargi Sheguri (gargi.sheguri@sjsu.edu)

Advisor: Dr. Chris Pollett

Description:

Yioop is an open-source, PHP-based search engine. Yioop's crawling mechanism makes use of a set of distributed queue servers and fetchers to generate (and rank) search results. The working assumption is that webpages are indexed in order of importance, such that the most important webpages are fetched first. In case a fetcher retrieval fails, or performs slower than another fetcher and induces a time lag in persisting the results, the webpages might be indexed out of order. As a result, important documents could be presented lower in order/excluded from the final index. The aim of this project is to modify the queuing process to maintain the expected order of search results.

Schedule:

Week 1: Jan 31 - Feb 7	Finalize project topic and deliverables
Week 2: Feb 7 - Feb 14	Work on Deliverable#1: Understand Yioop's ranking mechanisms and queuing process, related files in codebase
Week 3: Feb 14 - Feb 21	Complete and summarize Deliverable#1, read [1]
Week 4: Feb 21 - Feb 28	Start working on Deliverable#2: Understand and fix Yioop bug
Week 5: Feb 28 - Mar 7	Work on Deliverable#2, read [2] and [3]
Week 6: Mar 7 - Mar 14	Work on Deliverable#2
Week 7: Mar 14 - Mar 21	Complete and summarize Deliverable#2
Week 8: Mar 21 - Mar 28	Start working on Deliverable#3: Modifying Yioop's queuing process to retain expected index order
Week 9: Mar 28 - Apr 4	Work on Deliverable#3
Week 10: Apr 4 - Apr 11	Work on Deliverable#3
Week 11: Apr 11 - Apr 18	Work on Deliverable#3
Week 12: Apr 18 - Apr 25	Complete and summarize Deliverable#3
Week 13: Apr 25 - May 2	Start working on Deliverable#4: Study Yandex and incorporate one signal into Yioop, read [4]
Week 14: May 2 - May 9	Complete and summarize Deliverable#4, start working on Deliverable#5: 297 Report
Week 15: May 9 - May 16	Work on final report

Deliverables:

The full project will be done when CS298 is completed. The following will be done by the end of CS297:

1. Obtain the most recent copy of Yioop's source code: understand Yioop's ranking mechanisms, trace relevant code, and understand how the queuing process works.

2. Fix one outstanding bug in Yioop.

3. Modify Yioop's queue server such that the index is created by maintaining the queuing order.

4. Pick one Yandex signal and incorporate that into Yioop.

5. CS 297 report.

References:

[1] S. T. Ahmed, C. Sparkman, H. -T. Lee and D. Loguinov, "Around the web in six weeks: Documenting a large-scale crawl," 2015 IEEE Conference on Computer Communications (INFOCOM), Hong Kong, China, 2015, pp. 1598-1606, doi: 10.1109/INFOCOM.2015.7218539.

[2] Distributed web crawler architecture, by S. Severance. (2011, Dec. 15). US20110307467A1 [Online]. Available: https://patents.google.com/patent/US20110307467A1

[3] B. Cambazoglu and R. Baeza-Yates, "Scalability Challenges in Web Search Engines," in Synthesis Lectures on Information Concepts, Retrieval, and Services, vol. 7, 2011, pp. 27-50. doi: 10.1007/978-3-642-20946-8_2.

[4] M. Marin, R. Paredes, and C. Bonacic. "High-performance priority queues for parallel crawlers." In Proceedings of the 10th ACM workshop on Web information and data management, pp. 47-54. 2008.