Deliverable 2 - Naive Bayes Classifier

The main purpose of this deliverable is to recognize whether a given search string is from Java or from Python. Here the search string is in a form of a code snippet. This task of classifying the source code file is achieved by implementing a Naive Bayes Classifier in Java. Java and Python source code files were used in the training set.

For simplicity sake, all the source code from each java file is pasted in a text file. Each source code file is represented as a separate document separated by \n\n in the text file. However, \n still represents contents from the same document. In the program Java and Python programming languages are treated as hypothesis and their individual probabilities are calculated by taking a ratio of individual search results by total search results. Individual search result was founded by typing the keyword Java and Python separately in Google search bar and by making note of their individual search results. Both the search results were added to find the total number of search results.

After this step, files with Java and Python source codes were separately chunked into trigrams. Probability of each trigram is calculated separately for Java and Python. This initial probability of trigrams were calculated by taking the ratio of total number of documents in a given programming language containing the trigrams by total number of documents available for that particular programming language. Initial probabilities calculated for each trigram was recorded. Then search string is chunked into trigrams and these trigrams are recoded for future use. Now the probability of unknown trigram in Java and Python was calculated separately with hep of a random source code file in Java and Python (represented in document notation, as discussed earlier). Now, probability of unknown trigram is calculated by dividing of total number of trigrams from the random document which is not present in the training set by total number of trigrams present in the random document.

Now, initial probabilities recoded for each trigram from training set is smoothened by 1 - probability of unknown trigram for each programming language. This activity of smoothening the initial probabilities of trigrams is performed separately for Java and python. After calculating the smoothened probabilities, trigrams present in the query was searched against the recorder trigrams of training set. This task of searching the query trigrams in trigrams of the training set is performed separately for Java and Python. To calculate the final probability of the query, use moothened probability of that trigram in that particular language for the query trigrams which were present in training set of a specific language and use he unknown probability of that specific language for other trigrams which were not foung in the training set of a particular language.

The final probability calculation involves multiplying smoothened probabilities for each known trigram, unknown probability for each trigram, probability of hypothesis and Alpha. For simplicity sake Alpha is taken as one. This calculation is performed separately for Java and Python. Final results were compared to decide, query belongs to which programing language. The language with highest probability value will be the language of the query.

[Naive Bayes Classifier Program in Java - Zip]