CS298 Proposal

Enhancing the Queueing Process for Yioop's Scheduler

Gargi Sheguri (gargi.sheguri@sjsu.edu)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Robert Chun, Dr. Ben Reed

Abstract:

Yioop is an open-source, PHP-based search engine. Yioop's crawling mechanism makes use of a set of distributed queue servers and fetchers to generate (and rank) search results. The working assumption is that webpages are indexed in order of importance, such that the most important webpages are fetched first. The cumulative objective of these projects (CS297 and CS298) is to enhance the current URL scheduling queuing mechanism in Yioop. In CS297, we improved Yioop's scheduling implementation by incorporating a Selective Repeat ARQ notion to retain the order of displayed results by their calculated page ranks. This project focuses on the utilized page ranking mechanism and aims to improve the way that URLs are actually picked up while forming the schedule itself, i.e., to better the scoring methods used while creating a schedule. We will work towards this by experimenting with different ranking methods and comparing the obtained page rank results on a fixed set of test URLs.

CS297 Results

Researched the components involved in web search engines, the associated crawling, indexing, and ranking processes used by Yioop, and the relevant code base.
Fixed two outstanding bugs in Yioop: fixed deprecation warnings after upgrading to PHP v8.2 and modified Yioop's wiki editor UI to use icons instead of buttons.
Implemented Selective Repeat (Sliding Window) notion between QueueServer and Fetcher to retain the URL crawl order in Yioop. The window size is set to the number of fetchers used in the current crawl, and each schedule is given an identifying serial number.
Researched a Yandex signal to be incorporated into Yioop to boost the ranking mechanism. Yandex's FI_NUM_SLASHES is a positive ranking factor that adjusts document scores based on how far a URL is from the root/homepage.

Proposed Schedule

Week 1: Aug 22 - Aug 29	Submit CS298 proposal
Week 2 - 3: Aug 29 - Sept 12	Research and implement Yandex-inspired signals in Yioop
Week 4: Sept 12 - Sept 19	Research additional Queuing mechanisms and how they can be implemented in Yioop
Week 5 - 8: Sept 19 - Oct 17	Implement and experiment with selected mechanisms to compare page ranking results
Week 9 - 12: Oct 17 - Nov 15	Implement page ranking by most frequently viewed wiki frequency table in Yioop's Scanner
Week 13 - 15: Nov 13 - Dec 5	Work on CS298 report/presentation

Key Deliverables:

Software
- Implement new search factors to rank a URL by considering the constituent number of slashes and whether it points to a Wikipedia page (inspired by Yandex's FI_NUM_SLASHES and FI_IS_WIKI signals respectively) in Yioop to improve the page ranking quality
- Experiment with various design choices of queue bundling mechanisms and compare the effects on the resultant page ranking to the current implementation, for a fixed set of test URLs
- Rework Yioop's URL scanning logic to use the frequency table (if present) that orders wiki pages by how many times they have been viewed to improve MediaWiki archives indexing, which currently assigns page rank scores alphabetically
Report
- CS298 Report
- CS298 Presentation

Innovations and Challenges

Yandex does not provide any official documentation for how their signals are implemented in the source code, and so the purpose and application of their search factors have been pieced together by means of code comments and general knowledge. There are many open questions about how and why certain factors have been considered, whether they are actually incorporated into Yandex's SERP results, and how their associated scores have been assigned.
A major challenge of this project is determining which queuing mechanisms should be considered during experimentation, how they can be implemented and tested efficiently, and modifying Yioop's design to replace the existing page ranking implementation. We also need to determine which metrics are to be considered while comparing performances.
Indexing wiki archives is not a straightforward task but rather requires parsing over the entire MediaWiki entry for a page alphabetically to find its corresponding number of views and then using that to find the offset in the frequency table list.

References:

[1] A. Jain, R. Sharma, G. Dixit and V. Tomar, "Page Ranking Algorithms in Web Mining, Limitations of Existing Methods and a New Method for Indexing Web Pages," 2013 International Conference on Communication Systems and Network Technologies, Gwalior, India, 2013, pp. 640-645, doi: 10.1109/CSNT.2013.137.

[2] M. P. Selvan, A. C. Shekar, D. R. Babu and A. K. Teja, "Efficient ranking based on web page importance and personalized search," 2015 International Conference on Communications and Signal Processing (ICCSP), Melmaruvathur, India, 2015, pp. 1093-1097, doi: 10.1109/ICCSP.2015.7322671.

[3] T. Sen, D. K. Chaudhary and T. Choudhury, "Modified Page Rank Algorithm: Efficient Version of Simple Page Rank with Time, Navigation and Synonym Factor," 2017 3rd International Conference on Computational Intelligence and Networks (CINE), Odisha, India, 2017, pp. 27-32, doi: 10.1109/CINE.2017.24.

[4] D. Gupta and D. Singh, "User preference based page ranking algorithm," 2016 International Conference on Computing, Communication and Automation (ICCCA), Greater Noida, India, 2016, pp. 166-171, doi: 10.1109/CCAA.2016.7813711.

[5] M. King, "Yandex scrapes Google and other SEO learnings from the source code leak," Search Engine Land, Jan. 30, 2023. https://searchengineland.com/yandex-leak-learnings-392393