CS298 Proposal

Adaptive Clustering in Search Engines

Kuldeep Dhole (kuldeep.dhole@sjsu.edu)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Sami Khuri, Dr. Robert Chun

Abstract:

In the Spring 2015 semester, Hierarchical Clustering model was tried on Yioop search engine but there were two major problems with this approach: (1) Creating hierarchy on 40K index_shard documents was extremely slow because of O(n^3) (2)- Merging of hierarchies was not practical because the viable way was to throw the clusters to be merged in a single bucket, and then again create the new hierarchy, which has the tendency to become gradually slower, as no of hierarchies increase.

In Fall 2015, I am going to try K-means based approach. It has two levels:

  • Level 1: K-means Implementation
    Implementing the K-means algorithm, which create K clusters on the given set of documents. In Yioop, we are going to apply K-means on every index_shard, which approximately contains 40K documents. This implementation should be faster in a way that it should not delay Yioops next index_shard processing.
  • Level 2: Merging of clusters
    Level1 runs on 200 consecutive index shards, giving us the 200 different K-means clusters.
    In level 2, a global K-means runs on the 200 K-means clusters as input. In this level, the actual K-means center vectors of each cluster will be considered as a document.
    So, as each Level 1 K-means cluster has 200 clusters i.e. 200 center vectors, Level 2 K-means will have input as 200 * 200 = 40000 vectors set as document set, which eventually will give the global 200 clusters.

  • CS297 Results

    • Implemented a Naive Bayes Classifier to do Email Spam Classification
    • Implemented Hierarchical Agglomerative Clustering to form clusters on text documents
    • Studied Classification and Clustering working in Yioop
    • Made a report on Recipe Plugin working and how to scale out

    Proposed Schedule Fall 2015

    Week 1: Sep 14 - Sep 20Deliverable#1 : An implementation K-means algorithm for the Yioop search engine over one index_shard (~35k documents set)
    Week 2,3: Sep 21 - Oct 3Merging of K-means algorithm a.k.a. adaptive K-means design
    Week 4,5,6,7: Oct 4 - Oct 30Deliverable#2 : Implementation of Level-2 K-means algorithm, which runs on Yioop's index_shards
    Week 8: Nov 1 - Nov 7Deliverable #3: Running Level-2 in distributed environment
    Week 9: Nov 9 - Nov 15Deliverable#4: Displaying results on UI
    Week 10: Nov 16 - Nov 22Improvements/Modifications
    Week 11,12: Nov 23 - Dec 5Report/Presentation

    Key Deliverables:

    • Software
      • Hierarchical algorithm implementation and its downside
      • K-means implementation that runs on single index_shard of ~35k documents (independent i.e. not part of Yioop)
      • An advanced K-means implementation as part of Yioop that scales to millions of documents, which works in two levels as follows:
        • Level 1- K-means on one index shard of nearly 40k documents set, it keeps creating K-means for consecutive 200 index shards. Eventually, we will have 200 K-means clusters
        • Level 2- A global K-means on 200 K-means clusters as input i.e. on 200 * 40k = 8 million documents set
      • Displaying the clustered results on GUI i.e. visualizing the results
      • Report
        • CS298 Project Report
        • Code Documentation

      Innovations and Challenges

      • To implement an advanced K-means that works on two levels.
      • Visualizing the clusters.

      References:

      HIERARCHICAL CLUSTERING. Costa Siourbas. http://cgm.cs.mcgill.ca/~soss/cs644/projects/siourbas/, 1999.

      Information Retrieval: Implementing and Evaluating Search Engines. Buettcher, Clarke and Cormack. The MIT Press. 2010.

      Artificial Intelligence: A Modern Approach. Stuart Russell and Peter Norvig. Prentice Hall, 1st edition . 1995.