CS298 Proposal

Robust Cache System for Yioop

Rushikesh Padia (rushikeshlalit.padia@sjsu.edu)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Ben Reed, Batul Merchant

Abstract:

According to recent studies, the average internet user expects search results in 0.5 to 2 seconds. Also, Google has established guidelines that the response time for the search query should be under 200 milliseconds. Achieving such ambitious goals requires an efficient caching mechanism. In web search engines, caching is implemented at multiple layers to improve performance and response time. Caching query results has the highest impact on response time as it does not require the processing of queries. This also improves the performance of web servers by reducing the load. Yioop is one such open-source web search engine that implements result caching. The current implementation utilizes a single dynamic cache based on Marker’s algorithm. Having only a single dynamic cache could capture the short-term trend in queries, but it fails to capture the long-term trend and always popular queries. To capture such trends Static-Dynamic caches are commonly used. In this project, we are implementing a Static-Topic-Dynamic cache in Yioop which involves adding topic-based layer over the SD cache. In STD cache, a fraction of cache is allocated to this layer where cache entries are separated into different topics, such as weather and education. The search results based on a specific topic are stored in the cache section for that topic. This will allow Yioop to use cache space effectively by adapting to the temporal locality of different topics.

CS297 Results

Implemented CacheRefresh Media Job to periodically populate the cache.
Implemented dynamic cache with Machine Learning based admission and eviction policy. The algorithm achieved a hit rate of 28.33% while its counterpart LRU achieved a hit rate of 52.60% for the cache size of 100 and the query size of 4200.
Implemented Static-Topic-Dynamic cache using LDA features. The hit rate for the cache size of 100 and query size of 10K was 43.16% which was slightly higher than LRU achieving 42.33% and slightly lower than SDC achieving 43.35%.
Implemented Static-Semi-Static-Dynamic cache. For the cache size of 300 and the query size of 1.5 million, the hit rate was 55.76%, slightly lower than SDC with a hit rate of 56.65%.

Proposed Schedule

Week 1: 24 Jan - 31 Jan	Plan and decide on deliverables
Week 2: 31 Jan - 07 Feb	Create proposal
Week 3: 07 Feb - 14 Feb	Create test dataset and cache evaluation module
Week 4: 14 Feb - 21 Feb	Create test dataset and cache evaluation module
Week 5: 21 Feb - 28 Feb	Get performance reading of current caching algorithm
Week 6: 28 Feb - 07 Mar	Implement and train LDA model using enriched search queries
Week 7: 07 Mar - 14 Mar	Implement and train LDA model using enriched search queries
Week 8: 14 Mar - 21 Mar	Implement and train LDA model using enriched search queries
Week 9: 21 Mar - 28 Mar	Implement Static-Topic-Dynamic Cache
Week 10: 28 Mar - 04 Apr	Implement Static-Topic-Dynamic Cache
Week 11: 04 Apr - 11 Apr	Fine tune cache with different configurations. Start working on report
Week 12: 11 Apr - 18 Apr	Evaluate and conclude with performance results. Continue working on report
Week 13: 18 Apr - 25 Apr	Complete report and get it reviewed by Prof. Pollett
Week 14: 25 Apr - 02 May	Finish report and presentation and get it reviewed by the committee
Week 15: 02 May - 09 May	Review and fix report and presentation
Week 16: 09 May - 16 May	Practise presentation and demo
Week 17: 16 May - 23 May	Wrap up all work

Key Deliverables:

Software
- Create a test dataset and a module to evaluate the performance of cache in Yioop. Provide performance results of caching algorithm currently implemented in Yioop
- Implement Latent Dirichlet Allocation model and train it using search queries. LDA is an unsupervised topic model and will be used for assigning topics to the queries.
- Implement Static-Topic-Dynamic Cache in Yioop. Compare the performance with different configurations
Report
- CS 298 Report - a report fully describing the implementation of STD caching algorithm, dataset preparation for LDA, major design decisions, and performance evaluation against benchmarking algorithms
- CS 298 Presentation - a presentation containing overview of the complete project and the results of the implementations

Innovations and Challenges

As of our current knowledge, there is no publicly known open-source web search engine that utilizes a Static-Topic-Dynamic Cache. The paper has been cited by three other papers with none of them mentioning its usage.
The major challenge of the project is to train the LDA model. As search queries are usually short, to effectively categorize queries into topics they need to be enriched using summary texts from search results.
The existing design of the cache is tightly coupled with a single algorithm. The new design will allow switching between different caching algorithms. STD cache itself uses multiple instances of cache and a topic model. With the new design, different components of the cache will be loosely coupled to provide flexibility.
STD cache partitions has to be fine-tuned to achieve efficient utilization of cache space. Performance results of different configurations will be evaluated and compared with different benchmarking algorithms.

References:

[1] H. Ma, O. Tao, C. Zhao, P. Li, and L. Wang, "Impact of replacement policies on static-dynamic query results cache in web search engines," in 2017 IEEE International Conference on Intelligence and Security Informatics (ISI), 2017, pp. 137-139. doi: 10.1109/ISI.2017.8004890.

[2] R. Solar, V. Gil-Costa, and M. Marin, "Evaluation of Static/Dynamic Cache for Similarity Search Engines," in SOFSEM 2016: Theory and Practice of Computer Science, 2016, pp. 615-627.

[3] R. Blanco, E. Bortnikov, F. Junqueira, R. Lempel, L. Telloli, and H. Zaragoza, "Caching Search Engine Results over Incremental Indices," in Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2010, pp. 82-89. doi: 10.1145/1835449.1835466.

[4] T. Trinh, D. Wu, and J. Z. Huang, "C3C: A New Static Content-Based Three-Level Web Cache," IEEE Access, vol. 7, pp. 11796-11808, 2019, doi: 10.1109/ACCESS.2019.2892761.

[5] T. Kucukyilmaz, B. B. Cambazoglu, C. Aykanat, and R. Baeza-Yates, "A machine learning approach for result caching in web search engines," Information Processing & Management, vol. 53, no. 4, pp. 834-850, 2017, doi: https://doi.org/10.1016/j.ipm.2017.02.006.

[6] I. Mele, N. Tonellotto, O. Frieder, and R. Perego, "Topical result caching in web search engines", Information Processing & Management, vol. 57, no. 3, p. 102193, 2020, doi: https://doi.org/10.1016/j.ipm.2019.102193.

[7] M. Catena and N. Tonellotto, "Energy-Efficient Query Processing in Web Search Engines," IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 7, pp. 1412-1425, 2017, doi: 10.1109/TKDE.2017.2681279.

[8] B. Cambazoglu and R. Baeza-Yates, "Scalability Challenges in Web Search Engines" in Synthesis Lectures on Information Concepts, Retrieval, and Services, vol. 7, 2011, pp. 27-50. doi: 10.1007/978-3-642-20946-8_2."

[9] R. Ozcan, I. S. Altingovde, and A. Ulusoy, "Cost-Aware Strategies for Query Result Caching in Web Search Engines," ACM Trans. Web, vol. 5, no. 2, May 2011, doi: 10.1145/1961659.1961663.