CS298 Proposal
Adaptive Clustering in Search Engines
Kuldeep Dhole (kuldeep.dhole@sjsu.edu)
Advisor: Dr. Chris Pollett
Committee Members: Dr. Sami Khuri, Dr. Robert Chun
Abstract:
In the Spring 2015 semester, Hierarchical Clustering model was tried on Yioop search engine but there were two major problems with this approach:
(1) Creating hierarchy on 40K index_shard documents was extremely slow because of O(n^3) (2) Merging of hierarchies was not practical because the viable way was to throw the clusters to be merged in a single bucket, and then again create the new hierarchy, which has the tendency to become gradually slower, as no of hierarchies increase.
In Fall 2015, I am going to try Kmeans based approach.
It has two levels:

Level 1: Kmeans Implementation
Implementing the Kmeans algorithm, which create K clusters on the given set of documents. In Yioop, we are going to apply Kmeans on every index_shard, which approximately contains 40K documents. This implementation should be faster in a way that it should not delay Yioops next index_shard processing.

Level 2: Merging of clusters
Level1 runs on 200 consecutive index shards, giving us the 200 different Kmeans clusters.
In level 2, a global Kmeans runs on the 200 Kmeans clusters as input. In this level, the actual Kmeans center vectors of each cluster will be considered as a document.
So, as each Level 1 Kmeans cluster has 200 clusters i.e. 200 center vectors, Level 2 Kmeans will have input as 200 * 200 = 40000 vectors set as document set, which eventually will give the global 200 clusters.
CS297 Results
 Implemented a Naive Bayes Classifier to do Email Spam Classification
 Implemented Hierarchical Agglomerative Clustering to form clusters on text documents
 Studied Classification and Clustering working in Yioop
 Made a report on Recipe Plugin working and how to scale out
Proposed Schedule Fall 2015
Week 1:
Sep 14  Sep 20  Deliverable#1 : An implementation Kmeans algorithm for the Yioop search engine over one index_shard (~35k documents set)

Week 2,3:
Sep 21  Oct 3  Merging of Kmeans algorithm a.k.a. adaptive Kmeans design 
Week 4,5,6,7:
Oct 4  Oct 30  Deliverable#2 : Implementation of Level2 Kmeans algorithm, which runs on Yioop's index_shards 
Week 8:
Nov 1  Nov 7  Deliverable #3: Running Level2 in distributed environment 
Week 9:
Nov 9  Nov 15  Deliverable#4: Displaying results on UI 
Week 10:
Nov 16  Nov 22  Improvements/Modifications 
Week 11,12:
Nov 23  Dec 5  Report/Presentation 
Key Deliverables:
 Software
 Hierarchical algorithm implementation and its downside
 Kmeans implementation that runs on single index_shard of ~35k documents (independent i.e. not part of Yioop)
 An advanced Kmeans implementation as part of Yioop that scales to millions of documents, which works in two levels as follows:

Level 1 Kmeans on one index shard of nearly 40k documents set, it keeps creating Kmeans for consecutive 200 index shards. Eventually, we will have 200 Kmeans clusters

Level 2 A global Kmeans on 200 Kmeans clusters as input i.e. on 200 * 40k = 8 million documents set
 Displaying the clustered results on GUI i.e. visualizing the results
 Report
 CS298 Project Report
 Code Documentation
Innovations and Challenges
 To implement an advanced Kmeans that works on two levels.
 Visualizing the clusters.
References:
HIERARCHICAL CLUSTERING. Costa Siourbas. http://cgm.cs.mcgill.ca/~soss/cs644/projects/siourbas/, 1999.
Information Retrieval: Implementing and Evaluating Search Engines. Buettcher, Clarke and Cormack. The MIT Press. 2010.
Artificial Intelligence: A Modern Approach. Stuart Russell and Peter Norvig. Prentice Hall, 1st edition . 1995.