CS297 Proposal
Enhancing the queuing process for Yioop's scheduler
Gargi Sheguri (gargi.sheguri@sjsu.edu)
Advisor: Dr. Chris Pollett
Description:
Yioop is an open-source, PHP-based search engine. Yioop's crawling mechanism makes use of a set of distributed queue servers and fetchers to generate (and rank) search results. The working assumption is that webpages are indexed in order of importance, such that the most important webpages are fetched first. In case a fetcher retrieval fails, or performs slower than another fetcher and induces a time lag in persisting the results, the webpages might be indexed out of order. As a result, important documents could be presented lower in order/excluded from the final index. The aim of this project is to modify the queuing process to maintain the expected order of search results.
Schedule:
Week 1:
Jan 31 - Feb 7 | Finalize project topic and deliverables |
Week 2:
Feb 7 - Feb 14 | Work on Deliverable#1: Understand Yioop's ranking mechanisms and queuing process, related files in codebase |
Week 3:
Feb 14 - Feb 21 | Complete and summarize Deliverable#1, read [1] |
Week 4:
Feb 21 - Feb 28 | Start working on Deliverable#2: Understand and fix Yioop bug |
Week 5:
Feb 28 - Mar 7 | Work on Deliverable#2, read [2] and [3] |
Week 6:
Mar 7 - Mar 14 | Work on Deliverable#2 |
Week 7:
Mar 14 - Mar 21 | Complete and summarize Deliverable#2 |
Week 8:
Mar 21 - Mar 28 | Start working on Deliverable#3: Modifying Yioop's queuing process to retain expected index order |
Week 9:
Mar 28 - Apr 4 | Work on Deliverable#3 |
Week 10:
Apr 4 - Apr 11 | Work on Deliverable#3 |
Week 11:
Apr 11 - Apr 18 | Work on Deliverable#3 |
Week 12:
Apr 18 - Apr 25 | Complete and summarize Deliverable#3 |
Week 13:
Apr 25 - May 2 | Start working on Deliverable#4: Study Yandex and incorporate one signal into Yioop, read [4] |
Week 14:
May 2 - May 9 | Complete and summarize Deliverable#4, start working on Deliverable#5: 297 Report |
Week 15:
May 9 - May 16 | Work on final report |
Deliverables:
The full project will be done when CS298 is completed. The following will
be done by the end of CS297:
1. Obtain the most recent copy of Yioop's source code: understand Yioop's ranking mechanisms, trace relevant code, and understand how the queuing process works.
2. Fix one outstanding bug in Yioop.
3. Modify Yioop's queue server such that the index is created by maintaining the queuing order.
4. Pick one Yandex signal and incorporate that into Yioop.
5. CS 297 report.
References:
[1] S. T. Ahmed, C. Sparkman, H. -T. Lee and D. Loguinov, "Around the web in six weeks: Documenting a large-scale crawl," 2015 IEEE Conference on Computer Communications (INFOCOM), Hong Kong, China, 2015, pp. 1598-1606, doi: 10.1109/INFOCOM.2015.7218539.
[2] Distributed web crawler architecture, by S. Severance. (2011, Dec. 15). US20110307467A1 [Online]. Available: https://patents.google.com/patent/US20110307467A1
[3] B. Cambazoglu and R. Baeza-Yates, "Scalability Challenges in Web Search Engines," in Synthesis Lectures on Information Concepts, Retrieval, and Services, vol. 7, 2011, pp. 27-50. doi: 10.1007/978-3-642-20946-8_2.
[4] M. Marin, R. Paredes, and C. Bonacic. "High-performance priority queues for parallel crawlers." In Proceedings of the 10th ACM workshop on Web information and data management, pp. 47-54. 2008.
|