Chris Pollett > Students >
Padia

    ( Print View)

    [Bio]

    [Blog]

    [CS297 Proposal]

    [CS 297 Report PDF]

    [CS298 Proposal]

    [CS298 Report PDF]

    [CS298 Oral Defence Slides PDF]

    [Deliverable 1: CacheRefresh MediaJob]

    [Deliverable 2: Implement MLDC Algorithm]

    [Deliverable 3: Implement STDC Algorithm]

    [Deliverable 4: Implement SSDC Algorithm]

    [Understanding Yioop PDF]

    [Scalability Challenges PDF]

    [Cache Aware strategies PDF]

    [ML Based Cache Algorithm PDF]

    [Static Topic Dynamic Cache PDF]

    [Static Semi-Static Dynamic Cache PDF]

    [Query Statistics]

Deliverable 3
Implement Static Topic Dynamic Cache

Purpose
The goal of this deliverable is to implement Static-Topic-Dynamic Cache (STDC) "Topical result caching in web search engines"

Description

The given approach refines the traditional Static-Dynamic Cache (SDC) to capture the temporal locality of the topic in query streams. In SDC, static cache is populated periodically with the popular queries with assumption being popular queries in past remains popular in future. The dynamic part of the cache using LRU strategy captures the burst of short-term popular queries. The drawback of this approach is that it fails to capture queries with long-term popularity. There are certain queries which are neither too popular to be cached in static cache nor too frequent in short-term to be captured by dynamic cache. For instance, queries for certain topics like weather can be popular in the morning and late in the evening but are not popular in other parts of the day. To capture this temporal locality of the topic, topic cache is introduced along with the static and the dynamic part.

Assigning topic to a query is a challenging part as queries are generally very short. To classify queries correctly, queries are enriched with the search results. For this deliverable, queries are not enriched with search results, instead a news dataset is used to train the Latent Dirichlet Allocation (LDA) model. LDA model is an unsupervised statistical method to discover the hidden topic in the text. Given, k different topic, it assigns individual scores between [0, 1] to each topic. The topic is assigned to the query if the topic has maximum score and is greater than minimum threshold.

The topical cache has individual instances of cache of each topic. These cache instances can be LRU or SDC. The size of each instance can be configured to be fixed or variable depending upon the popularity of the topic. For this deliverable, we have evaluated STDC with both fixed sized (STDC) and variable size (STDC) topic cache and compared its performance with traditional SDC, LRU and optimal offline algorithm for 10K queries. (Code: Latent Dirichlet Allocation, Cache Simulator Zip)


Results
Result of STDC algorithm
Cache SizeConfigurationHit Rate
100LRU42.33%
100Belady49.08%
100SDC(30-70)43.35%
100STDC(30-50-20)43.16%
100STDC-V(30-50-20)43.10%
200LRU44.49%
200Belady49.81%
200SDC(15-85)45.16%
200SDTC(15-50-35)45.08%
200STDC-V(15-50-35)45.01%

Conclusion
The performance of this algorithm is significant improvement over the LRU baseline algorithm. The algorithm improved the hit rate by 2% resulting in gap reduction of 12% with optimal Belady. Also, compared to SDC, the performance is not significantly lower and can improved the better data set and fine tuning the number of topic. Thus, we will further explore this algorithm in the next semester