CS297 Proposal

Adaptive Clustering in Search Engines

Kuldeep Dhole (dkuldeep11@gmail.com)

Advisor: Dr. Chris Pollett

Description:

A search engine application provides categories of search result pages according to users' interest. Search result pages can be grouped according to the categories predicted by a query classification algorithm.

We will be using unsupervised learning, where given data does not have any label associated with it. A classifier is used to attach labels to pages/documents. In unsupervised learning, clustering technique comes up with label for document. A crawler uses labels to determine what to crawl. A web crawler is used to update web contents or indexes of other sites' web contents.

We are using Yioop, which is an open source search engine software developed by Dr. Pollett. Yioop comes with a crawler which does not have control over classifier and clustering. We are aiming to add adaptive clustering technique to the crawler using unsupervised learning so that it can scale out clustering.

Schedule:

Week 1: (Sep 1 - Sep 15)Understanding of unsupervised learning and Bayes classifier
Week 2,3: (Sep 16 - Oct 6 )Deliverable #2: Build Naïve Bayes classifer for given set of documents
Week 4: (Oct 7- Oct 21)Understanding of hierarchical clustering
Week 5,6: (Oct 22 - Oct Nov 5)Deliverable #3: Build Hierarchical Agglomerative Clustering for cluster analysis
Week 7: (Nov 5 - Nov 12)Understanding of Yioop
Week 8: (Nov 13 - Nov 20)Deliverable #4: Study of existing classifier and clustering code in Yioop
Week 9: (Nov 21 - Nov 28Understanding of recipe plug-in in Yioop
Week 10: (Nov 29 - Dec 5)Deliverable #5: Get the recipe plug-in to scale out clusters by ingredients
Week 11: (Dec 5 - Dec 12)Deliverable #6: CS297 report

Deliverables:

The full project will be done when CS298 is completed. The following will be done by the end of CS297:

1. Build Naive Bayes classifier for a given set of documents.

2. Build hierarchical agglomerative clustering for cluster analysis.

3. Study of existing classifier and clustering code in Yioop.

4. Get the recipe plug-in to scale out clusters by ingredients.

5. CS297 Report

References:

1. HIERARCHICAL CLUSTERING ALGORITHMS : http://cgm.cs.mcgill.ca/~soss/cs644/projects/siourbas/sect5.html

2. Information retrieval - Implementing and Evaluating Search Engines by Buettcher, Clarke and Cormack.

3. Yioop Documentation : http://www.seekquarry.com/?c=main&p=documentation

4. Artificial Intelligence- A Modern Approach by Stuart Russell and Peter Norvig : http://aima.cs.berkeley.edu/