CS298 Proposal
Adaptive Clustering in Search Engines
Kuldeep Dhole (kuldeep.dhole@sjsu.edu)
Advisor: Dr. Chris Pollett
Committee Members: Dr. Sami Khuri, Dr. Robert Chun
Abstract:
In the Spring 2015 semester, Hierarchical Clustering model was tried on Yioop search engine but there were two major problems with this approach:
(1) Creating hierarchy on 40K index_shard documents was extremely slow because of O(n^3) (2)- Merging of hierarchies was not practical because the viable way was to throw the clusters to be merged in a single bucket, and then again create the new hierarchy, which has the tendency to become gradually slower, as no of hierarchies increase.
In Fall 2015, I am going to try K-means based approach.
It has two levels:
-
Level 1: K-means Implementation
Implementing the K-means algorithm, which create K clusters on the given set of documents. In Yioop, we are going to apply K-means on every index_shard, which approximately contains 40K documents. This implementation should be faster in a way that it should not delay Yioops next index_shard processing.
-
Level 2: Merging of clusters
Level1 runs on 200 consecutive index shards, giving us the 200 different K-means clusters.
In level 2, a global K-means runs on the 200 K-means clusters as input. In this level, the actual K-means center vectors of each cluster will be considered as a document.
So, as each Level 1 K-means cluster has 200 clusters i.e. 200 center vectors, Level 2 K-means will have input as 200 * 200 = 40000 vectors set as document set, which eventually will give the global 200 clusters.
CS297 Results
- Implemented a Naive Bayes Classifier to do Email Spam Classification
- Implemented Hierarchical Agglomerative Clustering to form clusters on text documents
- Studied Classification and Clustering working in Yioop
- Made a report on Recipe Plugin working and how to scale out
Proposed Schedule Fall 2015
Week 1:
Sep 14 - Sep 20 | Deliverable#1 : An implementation K-means algorithm for the Yioop search engine over one index_shard (~35k documents set)
|
Week 2,3:
Sep 21 - Oct 3 | Merging of K-means algorithm a.k.a. adaptive K-means design |
Week 4,5,6,7:
Oct 4 - Oct 30 | Deliverable#2 : Implementation of Level-2 K-means algorithm, which runs on Yioop's index_shards |
Week 8:
Nov 1 - Nov 7 | Deliverable #3: Running Level-2 in distributed environment |
Week 9:
Nov 9 - Nov 15 | Deliverable#4: Displaying results on UI |
Week 10:
Nov 16 - Nov 22 | Improvements/Modifications |
Week 11,12:
Nov 23 - Dec 5 | Report/Presentation |
Key Deliverables:
- Software
- Hierarchical algorithm implementation and its downside
- K-means implementation that runs on single index_shard of ~35k documents (independent i.e. not part of Yioop)
- An advanced K-means implementation as part of Yioop that scales to millions of documents, which works in two levels as follows:
-
Level 1- K-means on one index shard of nearly 40k documents set, it keeps creating K-means for consecutive 200 index shards. Eventually, we will have 200 K-means clusters
-
Level 2- A global K-means on 200 K-means clusters as input i.e. on 200 * 40k = 8 million documents set
- Displaying the clustered results on GUI i.e. visualizing the results
- Report
- CS298 Project Report
- Code Documentation
Innovations and Challenges
- To implement an advanced K-means that works on two levels.
- Visualizing the clusters.
References:
HIERARCHICAL CLUSTERING. Costa Siourbas. http://cgm.cs.mcgill.ca/~soss/cs644/projects/siourbas/, 1999.
Information Retrieval: Implementing and Evaluating Search Engines. Buettcher, Clarke and Cormack. The MIT Press. 2010.
Artificial Intelligence: A Modern Approach. Stuart Russell and Peter Norvig. Prentice Hall, 1st edition . 1995. |