Chris Pollett > Students > Dhole

    Print View

    [Bio]

    [Blog]

    [CS 297 Proposal]

    [Deliverable #1: Naive Bayes Classifier]

    [Hierarchical Agglomerative Clustering - PDF]

    [Deliverable #2: Hierarchical Agglomerative Clustering]

    [Deliverable #3: Classifiers and Clustering in Yioop]

    [Deliverable #4: Recipe plugin scale out]

    [CS297 Report - PDF]

    [CS298 Proposal]

    [Revised CS298 Plan]

                          

























CS298 Proposal

Adaptive Clustering in Search Engines

Kuldeep Dhole (kuldeep.dhole@sjsu.edu)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Sami Khuri, Dr. Robert Chun

Abstract:

A search engine application provides categories of search result pages according to users' interest. Search result pages can be grouped according to the categories predicted by a query classification algorithm. In Yioop, which is an open source search engine software developed by Dr. Pollett, current results are delivered on a GUI without any organized clusters. We will be using an hierarchical clustering algorithm to deliver the search results organized. A web crawler constantly crawls the web pages, and creates indexes, simultaneously, a clustering algorithm will be constantly absorbing newly indexed data, and keep modifying the whole cluster. But, as crawling and indexing work in a distributed environment, we need to have a clustering algorithm, which is incremental and can scale out. We are aiming to add adaptive clustering technique to the crawler using unsupervised learning so that it can scale out clustering, and deliver the search results according to most relevant clusters.

CS297 Results

  • Implemented a Naive Bayes Classifier to do Email Spam Classification
  • Implemented Hierarchical Agglomerative Clustering to form clusters on text documents
  • Studied Classification and Clustering working in Yioop
  • Made a report on Recipe Plugin working and how to scale out

Proposed Schedule

Week 1,2: Jan.28-Feb.11An implementation of a clustering for the Yioop search engine via logarithmic merging algorithm that works in a single Yioop machine setting.
Week 3: Feb.12-Feb18Deliverable#1: Clustering in Yioop using logarithmic merging on a single machine setting.
Week 4: Feb.19-Feb25Study how search results are displayed on the Yioop GUI, and implementation of displaying clustered search results on the GUI
Week 5: Feb.26-Mar.02Deliverable#2: Delivering clustered search results on the Yioop GUI, in a single machine setting
Week 6,7: Mar.03-16Implementation of distributed and parallel version of clustering in Yioop
Week 8: Mar.17-23Deliverable#3: Clustering using logarithmic merging on multiple machines setting
Week 9,10: Mar.24 -Apr.06Deliverable#4: Delivering clustered search results on the Yioop GUI in multiple machines setting
Week 11: Apr.09-15Work on CS298 Report.
Week 12: Apr.16-22Create a first draft of CS298 report.
Week 13,14: Apr.23-May.06Create a final CS298 Report and submit to Advisor and committee members.
Week 15: May.07-13Defense

Key Deliverables:

  • Software
    • Clustering using logarithmic merging on asingle machine
    • Delivering clustered search results on Yioop on a single machine
    • Clustering using logarithmic merging on multiple machines
    • Delivering clustered search results on Yioop on multiple machines
  • Report
    • CS298 Project Report
    • Code Documentation

Innovations and Challenges

  • In search engine system, to create the clusters on the fly for newly indexed data is new thing, and to implement it in distributed and parallel way is challenge
  • To implement incremental cluster merging is challenge
  • On GUI, to deliver the search results in clustered mode is challenge

References:

HIERARCHICAL CLUSTERING. Costa Siourbas. http://cgm.cs.mcgill.ca/~soss/cs644/projects/siourbas/, 1999.

Information Retrieval: Implementing and Evaluating Search Engines. Buettcher, Clarke and Cormack. The MIT Press. 2010.

Artificial Intelligence: A Modern Approach. Stuart Russell and Peter Norvig. Prentice Hall, 1st edition . 1995.