Email Spam Classification by Support Vector Machine

Traditionally spam filtering techniques such as Black and White List were employed but with todays state of the Internet these methods are becoming Obsolete. With increasing popularity of the internet it is difficult to prepare a spam filter to effectively separate the spam mails from useful mails automatically before even they enter the inbox and thus crowding up the space in the inbox. Many computer scientists have been working on the methods to develop a machine learning based algorithm using statistical learning methods to tackle this problem. What is considered as a major concern right now is to make a spam filter that can efficiently capture all the spam messages and all the variety they come in and at the same time perform at a high rate. Within the context of Machine learning SVM can play a major role in spam detections and filtering however SVM faces one problem which is the choice of the kernel for the SVM that direly affects its performance. In this paper, we evaluate the performance of Non Linear SVM based classifiers with two different kernel functions i.e. Linear Kernel and Gaussian Kernel over SpamAssasin Public Corpus Dataset. Furthermore we compare the Training and Testing accuracy of these 2 kernels on the above mentioned dataset and attempt to explain which Kernel Behaves better with which dataset. Then we take some Emails extracted from Gmails Inbox and spam container and test our classifier on them.