CS298 Proposal
Adaptive Clustering in Search Engines
Kuldeep Dhole (kuldeep.dhole@sjsu.edu)
Advisor: Dr. Chris Pollett
Committee Members: Dr. Sami Khuri, Dr. Robert Chun
Abstract:
A search engine application provides categories of search result pages according to users' interest. Search result pages can be grouped according to the categories predicted by a query classification algorithm. In Yioop, which is an open source search engine software developed by Dr. Pollett, current results are delivered on a GUI without any organized clusters. We will be using an hierarchical clustering algorithm to deliver the search results organized. A web crawler constantly crawls the web pages, and creates indexes, simultaneously, a clustering algorithm will be constantly absorbing newly indexed data, and keep modifying the whole cluster. But, as crawling and indexing work in a distributed environment, we need to have a clustering algorithm, which is incremental and can scale out. We are aiming to add adaptive clustering technique to the crawler using unsupervised learning so that it can scale out clustering, and deliver the search results according to most relevant clusters.
CS297 Results
- Implemented a Naive Bayes Classifier to do Email Spam Classification
- Implemented Hierarchical Agglomerative Clustering to form clusters on text documents
- Studied Classification and Clustering working in Yioop
- Made a report on Recipe Plugin working and how to scale out
Proposed Schedule
Week 1,2:
Jan.28-Feb.11 | An implementation of a clustering for the Yioop search engine via logarithmic merging algorithm
that works in a single Yioop machine setting.
|
Week 3:
Feb.12-Feb18 | Deliverable#1: Clustering in Yioop using logarithmic merging on a single machine setting. |
Week 4:
Feb.19-Feb25 | Study how search results are displayed on the Yioop GUI, and implementation of displaying clustered search results on the GUI |
Week 5:
Feb.26-Mar.02 | Deliverable#2: Delivering clustered search results on the Yioop GUI, in a single machine setting |
Week 6,7:
Mar.03-16 | Implementation of distributed and parallel version of clustering in Yioop |
Week 8:
Mar.17-23 | Deliverable#3: Clustering using logarithmic merging on multiple machines setting |
Week 9,10:
Mar.24 -Apr.06 | Deliverable#4: Delivering clustered search results on the Yioop GUI in multiple machines setting |
Week 11:
Apr.09-15 | Work on CS298 Report. |
Week 12:
Apr.16-22 | Create a first draft of CS298 report. |
Week 13,14:
Apr.23-May.06 | Create a final CS298 Report and submit to Advisor and committee members. |
Week 15:
May.07-13 | Defense |
Key Deliverables:
- Software
- Clustering using logarithmic merging on asingle machine
- Delivering clustered search results on Yioop on a single machine
- Clustering using logarithmic merging on multiple machines
- Delivering clustered search results on Yioop on multiple machines
- Report
- CS298 Project Report
- Code Documentation
Innovations and Challenges
- In search engine system, to create the clusters on the fly for newly indexed data is new thing, and to implement it in distributed and parallel way is challenge
- To implement incremental cluster merging is challenge
- On GUI, to deliver the search results in clustered mode is challenge
References:
HIERARCHICAL CLUSTERING. Costa Siourbas. http://cgm.cs.mcgill.ca/~soss/cs644/projects/siourbas/, 1999.
Information Retrieval: Implementing and Evaluating Search Engines. Buettcher, Clarke and Cormack. The MIT Press. 2010.
Artificial Intelligence: A Modern Approach. Stuart Russell and Peter Norvig. Prentice Hall, 1st edition . 1995. |