A study of Classifiers and Clustering in Yioop

Aim

To understand and make a report on Classifiers and Clustering in Yioop

Introduction

Yioop does a number of things to improve the quality of its search results. Yioop Indexes can be used to create classifiers which then can be used in labeling and ranking future indexes. For example, you can add the classifier externally for specific label, like "finance" class and having meta words for it like bank, savings, checking, account, withdraw, credit, debit, etc.

Overview

classifier_tool.php is a command line tool for creating a classifier it can be used to perform some of the tasks that can also be done through the Web Classifier Interface. classifier_trainer.php is a daemon used in the finalization stage of building a classifier. The Classifiers activity provides a way to train classifiers that recognize classes of documents; these classifiers can then be used during a crawl to add appropriate meta words to pages determined to belong to one or more classes. Clicking on the Classifiers activity displays a text field where you can create a new classifier, and a table of existing classifiers, where each row corresponds to a classifier and provides some statistics and action links. A classifier is identified by its class label, which is also used to form the meta word that will be attached to documents.

Description

Primary there are following programs, which manage the whole classification in Yioop.

The primary class classifier.php controls the behavior of classification. An instance of this class represents a single classifier in memory, but the class also provides static methods to manage classifiers on disk.

A single classifier is a tool for determining the likelihood that a document is a positive instance of a particular class. In order to do this, a classifier goes through a training phase on a labeled training set where it learns weights for document features (terms, for our purposes). To classify a new document, the learned weights for all terms in the document are combined in order to yield a pdeudo-probability that the document belongs to the class.

A classifier is composed of a candidate buffer, a training set, a set of features, and a classification algorithm. In addition to the set of all features, there is a restricted set of features used for training and classification. There are also two classification algorithms: a Naive Bayes algorithm used during labeling, and a logistic regression algorithm used to train the final classifier. In general, a fresh classifier will first go through a labeling phase where a collection of labeled training documents is built up out of existing crawl indexes, and then a finalization phase where the logistic regression algorithm will be trained on the training set established in the first phase. After finalization, the classifier may be used to classify new web pages during a crawl.

During the labeling phase, the classifier fills a buffer of candidate pages from the user-selected index (optionally restricted by a query), and tries to pick the best one to present to the user to be labeled (here `best' means the one that, once labeled, is most likely to improve classification accuracy). Each labeled document is removed from the buffer, converted to a feature vector (described next), and added to the training set. The expanded training set is then used to train an intermediate Naive Bayes classification algorithm that is in turn used to more accurately identify good candidates for the next round of labeling. This phase continues until the user gets tired of labeling documents, or is happy with the estimated classification accuracy.

Instead of passing around terms everywhere, each document that goes into the training set is first mapped through a Features instance that maps terms to feature indices (e.g. "Pythagorean" => 1, "theorem" => 2, etc.). These feature indices are used internally by the classification algorithms, and by the algorithms that try to pick out the most informative features.

Once a sufficiently-useful training set has been built, a FeatureSelection instance is used to choose the most informative features, and copy these into a reduced Features instance that has a much smaller vocabulary, and thus a much smaller memory footprint. For efficiency, this is the Features instance used to train classification algorithms, and to classify web pages. Finalization is just the process of training a logistic regression classification algorithm on the full training set.

A short on command line tool execution of classifier, bin/classifier_tool.php, To build and evaluate a classifier for the label 'spam', trained using the two indexes "DATASET Neg" and "DATASET Pos", and a maximum of the top 25 most informative features: php bin/classifier_tool.php -a TrainAndTest -d 'DATASET' -l 'spam' -I cls.chi2.max=25