Chris Pollett >
Students > [Bio] [Blog] [Paper 1: Large-scale IRLBot crawl PDF] [Paper 2: Distributed Crawler Architecture PDF] [Paper 3: Scalability Challenges PDF] [Paper 4: High Performance Priority Queues PDF] [Deliverable 1: Yioop Ranking Mechanisms PDF] [Deliverable 2(ii): Modifying Yioop's UI Editor] [Deliverable 3: Modifying Yioop's queuing process] [Deliverable 4: Yandex Signal PDF] [C298: Yandex-inspired Search Factors] [C298: Latest Page Version in SERP] |
CS297 ProposalEnhancing the queuing process for Yioop's schedulerGargi Sheguri (gargi.sheguri@sjsu.edu) Advisor: Dr. Chris Pollett Description: Yioop is an open-source, PHP-based search engine. Yioop's crawling mechanism makes use of a set of distributed queue servers and fetchers to generate (and rank) search results. The working assumption is that webpages are indexed in order of importance, such that the most important webpages are fetched first. In case a fetcher retrieval fails, or performs slower than another fetcher and induces a time lag in persisting the results, the webpages might be indexed out of order. As a result, important documents could be presented lower in order/excluded from the final index. The aim of this project is to modify the queuing process to maintain the expected order of search results. Schedule:
Deliverables: The full project will be done when CS298 is completed. The following will be done by the end of CS297: 1. Obtain the most recent copy of Yioop's source code: understand Yioop's ranking mechanisms, trace relevant code, and understand how the queuing process works. 2. Fix one outstanding bug in Yioop. 3. Modify Yioop's queue server such that the index is created by maintaining the queuing order. 4. Pick one Yandex signal and incorporate that into Yioop. 5. CS 297 report. References: [1] S. T. Ahmed, C. Sparkman, H. -T. Lee and D. Loguinov, "Around the web in six weeks: Documenting a large-scale crawl," 2015 IEEE Conference on Computer Communications (INFOCOM), Hong Kong, China, 2015, pp. 1598-1606, doi: 10.1109/INFOCOM.2015.7218539. [2] Distributed web crawler architecture, by S. Severance. (2011, Dec. 15). US20110307467A1 [Online]. Available: https://patents.google.com/patent/US20110307467A1 [3] B. Cambazoglu and R. Baeza-Yates, "Scalability Challenges in Web Search Engines," in Synthesis Lectures on Information Concepts, Retrieval, and Services, vol. 7, 2011, pp. 27-50. doi: 10.1007/978-3-642-20946-8_2. [4] M. Marin, R. Paredes, and C. Bonacic. "High-performance priority queues for parallel crawlers." In Proceedings of the 10th ACM workshop on Web information and data management, pp. 47-54. 2008. |