Chris Pollett >
Students > [Bio] [Blog] |
Deliverable 2
Implement Machine Learning Dynamic Cache
Purpose
The goal of this deliverable is to implement a cache eviction policy based
on the framework described in the paper "A machine learning approach for result caching in web search engines".
Description
The paper proposes a machine learning based cache admission and eviction algorithm. In this framework, a set of features are derived from query logs. The admission and eviction decision are based on a machine learning model trained from these features. The paper proposes machine learning models for both static and dynamic cache. Populating static cache has been modeled as an offline cache allocation problem while the latter has been modeled as an online eviction problem. To derive features from the query log, AOL query logs were used. The subset of features was selected based on the availability of query logs and Yioop indexing data structures. ‘IAT_NEXT’ is the ground truth value that represents the next time the same query appears i.e., the number of queries between the next appearance of the query. The goal of the machine learning model was to predict the IAT_NEXT value as close to the actual IAT_NEXT value. The query having the highest IAT_NEXT value was evicted if the cache was full. To create the features, queries were first cleaned (Code: Data Cleaning). The query was converted to lowercase, stop words were removed, and words were stemmed. Features were then extracted from the clean queries. The key observation from the data was that the data was highly skewed. More than 80% of the data had an IAT_NEXT value of less than 1% of the total data (Figure 1). To mitigate this issue, IAT_NEXT values were binned using logarithmic binning. This reduced the effect of skewness in the data (Figure 2). A regression model was fitted and evaluated on this data. (Code: Machine Learning Dynamic Cache Training Zip)
Results
The model achieved a hit rate of 28.33% for cache size of 100 and 4200 queries.
While its counterpart LRU achieved the hit rate of 52.60% and
optimal offline algorithm - Belady’s algorithm achieved a 57.27% hit rate..
Conclusion
As the hit rate is significantly low, we can conclude that the performance of
this algorithm is not suitable for our implementation. Although the performance
can be improved by adding more features but will require lot of effort in fine
tuning and feature engineering. Hence, this algorithm will not be considered
for the further development.
Figure 1: Distribution of IAT_NEXT values Figure 2: Distribution of IAT_NEXT values after log binning |