Naive Bayes Classifier For Email Spam Classification
Aim
To implement Naive Bayes Classifier from scratch to classify Email Spams.
Introduction
In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features.
In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.
Description
My primary task was to understand the Naive Bayes Classifier in Machine Learning and apply it on Email Spams classification
Fork me on GitHub OR
Source Code - ZIP
Email spams are manually pulled from spam inbox and pre-classified into 3 classes: 1) I [Internet Advertising] 2) M [Medical Traps] 3) P [Phishing]
We make 2 datasets from the given emails set: 1) Training Data 2) Test Data. On training data, we apply the Naive Bayes Classifier, implemented in Python, and make the classifier model ready.
Now, this model is applied on the Training Data and we predict the Class for every spam email.
Eventually, we find the accuracy by comparing actual classes and predicted classes.
Data Preprocessing and Data Format:
Given raw data is a flat file, and after processing, it is brought to the format mentioned below:
record number, words, class
e.g. 1, w1, w2, w3, M
2, w3, w4, w5, P
3, w5, w6, w7, I
Functional Overview
NAME
naive_bayes_email_classifier
CLASSES
Tuple
class Tuple
| This is tuple class, whic holds the data in the format of
| doc_id,feature1,feature2,....,class
|
| Methods defined here:
|
| __init__(self)
|
| getClass(self)
|
| getFeatures(self)
|
| getId(self)
|
| setClass(self, cls)
|
| setFeatures(self, txt)
|
| setId(self, id)
|
| show(self)
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| cnt = 1
FUNCTIONS
evaluateAccuracy(test_tuples, posterior)
This function calculates accuracy
filterClasses(training_tuples)
This generates unique classes that are possible in the given dataset
generateLikelihood(training_tuples, vocab, classes)
This function returns likelihood
generatePrior(training_tuples, classes)
This function returns prior
getVocab(training_tuples)
This function generate unique words from the given text in all tuples
loadData(training_data)
This is a function, which loads the data from file into array
main()
#------------MAIN------------
predict(test_tuples, classes, prior, likelihood)
This function calculates max a prior for every test record
and returns hash of records as doc => prediction
showResults(training_data, test_data, posterior)
This function shows the training data, test data and predictions
Training Data Snapshot
Test Data Snapshot
Output Snapshot
Fork me on GitHub OR
Source Code - ZIP
References
Wikipedia
Princeton Document
Stanford PDF
|