CS298 Proposal

Adaptive Clustering in Search Engines

Kuldeep Dhole (kuldeep.dhole@sjsu.edu)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Sami Khuri, Dr. Robert Chun

Abstract:

In the Spring 2015 semester, Hierarchical Clustering model was tried on Yioop search engine but there were two major problems with this approach: (1) Creating hierarchy on 40K index_shard documents was extremely slow because of O(n^3) (2)- Merging of hierarchies was not practical because the viable way was to throw the clusters to be merged in a single bucket, and then again create the new hierarchy, which has the tendency to become gradually slower, as no of hierarchies increase.

In Fall 2015, I am going to try K-means based approach. It has two levels:

Level 1: K-means Implementation
Implementing the K-means algorithm, which create K clusters on the given set of documents. In Yioop, we are going to apply K-means on every index_shard, which approximately contains 40K documents. This implementation should be faster in a way that it should not delay Yioops next index_shard processing.
Level 2: Merging of clusters
Level1 runs on 200 consecutive index shards, giving us the 200 different K-means clusters.
In level 2, a global K-means runs on the 200 K-means clusters as input. In this level, the actual K-means center vectors of each cluster will be considered as a document.
So, as each Level 1 K-means cluster has 200 clusters i.e. 200 center vectors, Level 2 K-means will have input as 200 * 200 = 40000 vectors set as document set, which eventually will give the global 200 clusters.

CS297 Results

Implemented a Naive Bayes Classifier to do Email Spam Classification
Implemented Hierarchical Agglomerative Clustering to form clusters on text documents
Studied Classification and Clustering working in Yioop
Made a report on Recipe Plugin working and how to scale out

Proposed Schedule Fall 2015

Week 1: Sep 14 - Sep 20	Deliverable#1 : An implementation K-means algorithm for the Yioop search engine over one index_shard (~35k documents set)
Week 2,3: Sep 21 - Oct 3	Merging of K-means algorithm a.k.a. adaptive K-means design
Week 4,5,6,7: Oct 4 - Oct 30	Deliverable#2 : Implementation of Level-2 K-means algorithm, which runs on Yioop's index_shards
Week 8: Nov 1 - Nov 7	Deliverable #3: Running Level-2 in distributed environment
Week 9: Nov 9 - Nov 15	Deliverable#4: Displaying results on UI
Week 10: Nov 16 - Nov 22	Improvements/Modifications
Week 11,12: Nov 23 - Dec 5	Report/Presentation

Key Deliverables:

Software
- Hierarchical algorithm implementation and its downside
- K-means implementation that runs on single index_shard of ~35k documents (independent i.e. not part of Yioop)
- An advanced K-means implementation as part of Yioop that scales to millions of documents, which works in two levels as follows:
  - Level 1- K-means on one index shard of nearly 40k documents set, it keeps creating K-means for consecutive 200 index shards. Eventually, we will have 200 K-means clusters
  - Level 2- A global K-means on 200 K-means clusters as input i.e. on 200 * 40k = 8 million documents set
- Displaying the clustered results on GUI i.e. visualizing the results
- Report
  - CS298 Project Report
  - Code Documentation
Innovations and Challenges
- To implement an advanced K-means that works on two levels.
- Visualizing the clusters.
References:

HIERARCHICAL CLUSTERING. Costa Siourbas. http://cgm.cs.mcgill.ca/~soss/cs644/projects/siourbas/, 1999.

Information Retrieval: Implementing and Evaluating Search Engines. Buettcher, Clarke and Cormack. The MIT Press. 2010.

Artificial Intelligence: A Modern Approach. Stuart Russell and Peter Norvig. Prentice Hall, 1st edition . 1995.