Description:

In a typical document classification task, the input to the machine learning algorithm (both during learning and classification) is text. From this, a bag of words (BOW) model can be constructed: the individual tokens are extracted and counted, and each distinct token in the training set defines a feature of each of the documents in both the training and test sets. Hashing function can be defined here as [from wikipedia] : function hashing_vectorizer(features : array of string, N : integer):
x := new vector[N]
for f in features:
h := hash(f)
x[h mod N] += 1
return x

Deliverables: