Chris Pollett > Students > Dhole
[Bio] [Blog] [Deliverable #1: Naive Bayes Classifier] [Hierarchical Agglomerative Clustering - PDF] [Deliverable #2: Hierarchical Agglomerative Clustering] [Deliverable #3: Classifiers and Clustering in Yioop] [Deliverable #4: Recipe plugin scale out] |
Hierarchical Agglomerative Clustering Proof of Concept On Sample Documents SetAimTo understand and implement Hierarchical Agglomerative Clustering Algorithm in Machine Learning Fork me on GitHub OR Source Code - IPython Notebook IntroductionIn data mining, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters.
Strategies for hierarchical clustering generally fall into two types: DescriptionMy primary task was to understand the Hierarchical Agglomerative Clustering Algorithm in Machine Learning and apply it on the corpus of sample documents The idea was to form clusters of similar documents and keep going upwards i.e. to form more clusters of created clusters until we reach to only one cluster. Algorithm-Given n points p1, p2, … pn; We assume all as different clusters c1, c2, c3,.... cn num = #n Vectorization
In order to compare the documents on same standard, I used Vectorization of the documents, which is based on Bag of Words model.
For example, doc1 = "this is cat", doc2 = "this is hat" Documents' Similarity Measure
After having the documents in form of vectors, their similarity is measured based on COSINE SIMILARITY concept. Inputcorpus = ["this is dog this is cat", "this is cat", "that is different", "he is different", "they are happy"] Output Dendogram
A Note On IPython
It would be very convenient if IPython Notbook is used to view and execute the code. Fork me on GitHub OR Source Code - IPython Notebook References |