Parameter Learning: Naive Bayes Classifiers

[5 P]Implement an algorithm for learning a naive Bayes classifier and apply it to a spam email data set. You are required to use MATLAB for this assignment. The spam dataset is available for download on the course homepage5.


Write a function called nbayes_learn.m that takes a training dataset for a binary classification task with binary attributes and returns the posterior Beta distributions of all model parameters (specified by variables $ a'_i$ and $ b'_i$ for the $ i$ th model parameter) of a naive Bayes classifier given a prior Beta distribution for each of the model parameters (specified by variables $ a_i$ and $ b_i$ for the $ i$ th model parameter).


Write a function called nbayes_predict.m that takes a set of test data vectors and returns the most likely class label predictions for each input vector based on the posterior parameter distributions obtained in a).


Use both functions to conduct the following experiment. For your assignment you will be working with a data set that was created a few years ago at the Hewlett Packard Research Labs as a testbed data set to test different spam email classification algorithms.

Train a naive Bayes model on the first 2500 samples (using Laplace uniform prior distributions) and report the classification error of the trained model on a test data set consisting of the remaining examples that were not used for training.

Repeat the previous step, now training on the first {10, 50, 100, 200, ... , 500} samples, and again testing on the same test data as used in point 1 (samples 2501 through 4601). Report the classification error on the test dataset as a function of the number of training examples. Hand in a plot of this function.

Comment on how accurate the classifier would be if it would randomly guess a class label or it would always pick the most common label in the training data. Compare these performance values to the results obtained for the naive Bayes model.

Present your results clearly, structured and legible. Document them in such a way that anybody can reproduce them effortless. Send the code of your solution to

Hubner Florian 2014-01-21