CS298 Proposal
Robust Cache System for Yioop
Rushikesh Padia (rushikeshlalit.padia@sjsu.edu)
Advisor: Dr. Chris Pollett
Committee Members: Dr. Ben Reed, Batul Merchant
Abstract:
According to recent studies, the average internet user expects search results in 0.5 to 2 seconds.
Also, Google has established guidelines that the response time for the search query should be under 200 milliseconds.
Achieving such ambitious goals requires an efficient caching mechanism.
In web search engines, caching is implemented at multiple layers to improve performance and response time.
Caching query results has the highest impact on response time as it does not require the processing of queries.
This also improves the performance of web servers by reducing the load. Yioop is one such open-source web search engine
that implements result caching. The current implementation utilizes a single dynamic cache based on Marker’s algorithm.
Having only a single dynamic cache could capture the short-term trend in queries, but it fails to capture
the long-term trend and always popular queries. To capture such trends Static-Dynamic caches are commonly used.
In this project, we are implementing a Static-Topic-Dynamic cache in Yioop which involves adding topic-based layer over the SD cache.
In STD cache, a fraction of cache is allocated to this layer where cache entries are separated into different topics, such as weather and education.
The search results based on a specific topic are stored in the cache section for that topic. This will allow Yioop
to use cache space effectively by adapting to the temporal locality of different topics.
CS297 Results
- Implemented CacheRefresh Media Job to periodically populate the cache.
- Implemented dynamic cache with Machine Learning based admission and eviction policy.
The algorithm achieved a hit rate of 28.33% while its counterpart LRU achieved a hit rate of 52.60% for
the cache size of 100 and the query size of 4200.
- Implemented Static-Topic-Dynamic cache using LDA features. The hit rate for the cache size of 100 and query
size of 10K was 43.16% which was slightly higher than LRU achieving 42.33% and slightly lower than SDC achieving 43.35%.
- Implemented Static-Semi-Static-Dynamic cache. For the cache size of 300 and the query size of 1.5 million,
the hit rate was 55.76%, slightly lower than SDC with a hit rate of 56.65%.
Proposed Schedule
Week 1:
24 Jan - 31 Jan | Plan and decide on deliverables |
Week 2:
31 Jan - 07 Feb | Create proposal |
Week 3:
07 Feb - 14 Feb | Create test dataset and cache evaluation module |
Week 4:
14 Feb - 21 Feb | Create test dataset and cache evaluation module |
Week 5:
21 Feb - 28 Feb | Get performance reading of current caching algorithm |
Week 6:
28 Feb - 07 Mar | Implement and train LDA model using enriched search queries |
Week 7:
07 Mar - 14 Mar | Implement and train LDA model using enriched search queries |
Week 8:
14 Mar - 21 Mar | Implement and train LDA model using enriched search queries |
Week 9:
21 Mar - 28 Mar | Implement Static-Topic-Dynamic Cache |
Week 10:
28 Mar - 04 Apr | Implement Static-Topic-Dynamic Cache |
Week 11:
04 Apr - 11 Apr | Fine tune cache with different configurations. Start working on report |
Week 12:
11 Apr - 18 Apr | Evaluate and conclude with performance results. Continue working on report |
Week 13:
18 Apr - 25 Apr | Complete report and get it reviewed by Prof. Pollett |
Week 14:
25 Apr - 02 May | Finish report and presentation and get it reviewed by the committee |
Week 15:
02 May - 09 May | Review and fix report and presentation |
Week 16:
09 May - 16 May | Practise presentation and demo |
Week 17:
16 May - 23 May | Wrap up all work |
Key Deliverables:
- Software
- Create a test dataset and a module to evaluate the performance of cache in Yioop. Provide performance
results of caching algorithm currently implemented in Yioop
- Implement Latent Dirichlet Allocation model and train it using search queries.
LDA is an unsupervised topic model and will be used for assigning topics to the queries.
- Implement Static-Topic-Dynamic Cache in Yioop. Compare the performance with different configurations
- Report
- CS 298 Report - a report fully describing the implementation of STD caching algorithm, dataset preparation for LDA,
major design decisions, and performance evaluation against benchmarking algorithms
- CS 298 Presentation - a presentation containing overview of the complete project and the results of the implementations
Innovations and Challenges
- As of our current knowledge, there is no publicly known open-source web search engine that utilizes a Static-Topic-Dynamic Cache.
The paper has been cited by three other papers with none of them mentioning its usage.
- The major challenge of the project is to train the LDA model. As search queries are usually short, to effectively categorize
queries into topics they need to be enriched using summary texts from search results.
- The existing design of the cache is tightly coupled with a single algorithm. The new design will allow switching between different
caching algorithms. STD cache itself uses multiple instances of cache and a topic model. With the new design,
different components of the cache will be loosely coupled to provide flexibility.
- STD cache partitions has to be fine-tuned to achieve efficient utilization of cache space. Performance results of
different configurations will be evaluated and compared with different benchmarking algorithms.
References:
[1] H. Ma, O. Tao, C. Zhao, P. Li, and L. Wang, "Impact of replacement policies on static-dynamic query results cache in web search engines," in 2017 IEEE International Conference on Intelligence and Security Informatics (ISI), 2017, pp. 137-139. doi: 10.1109/ISI.2017.8004890.
[2] R. Solar, V. Gil-Costa, and M. Marin, "Evaluation of Static/Dynamic Cache for Similarity Search Engines," in SOFSEM 2016: Theory and Practice of Computer Science, 2016, pp. 615-627.
[3] R. Blanco, E. Bortnikov, F. Junqueira, R. Lempel, L. Telloli, and H. Zaragoza, "Caching Search Engine Results over Incremental Indices," in Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2010, pp. 82-89. doi: 10.1145/1835449.1835466.
[4] T. Trinh, D. Wu, and J. Z. Huang, "C3C: A New Static Content-Based Three-Level Web Cache," IEEE Access, vol. 7, pp. 11796-11808, 2019, doi: 10.1109/ACCESS.2019.2892761.
[5] T. Kucukyilmaz, B. B. Cambazoglu, C. Aykanat, and R. Baeza-Yates, "A machine learning approach for result caching in web search engines," Information Processing & Management, vol. 53, no. 4, pp. 834-850, 2017, doi: https://doi.org/10.1016/j.ipm.2017.02.006.
[6] I. Mele, N. Tonellotto, O. Frieder, and R. Perego, "Topical result caching in web search engines", Information Processing & Management, vol. 57, no. 3, p. 102193, 2020, doi: https://doi.org/10.1016/j.ipm.2019.102193.
[7] M. Catena and N. Tonellotto, "Energy-Efficient Query Processing in Web Search Engines," IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 7, pp. 1412-1425, 2017, doi: 10.1109/TKDE.2017.2681279.
[8] B. Cambazoglu and R. Baeza-Yates, "Scalability Challenges in Web Search Engines" in Synthesis Lectures on Information Concepts, Retrieval, and Services, vol. 7, 2011, pp. 27-50. doi: 10.1007/978-3-642-20946-8_2."
[9] R. Ozcan, I. S. Altingovde, and A. Ulusoy, "Cost-Aware Strategies for Query Result Caching in Web Search Engines," ACM Trans. Web, vol. 5, no. 2, May 2011, doi: 10.1145/1961659.1961663.
|