Text and image based spam email classification using KNN, Naïve Bayes and Reverse DBSCAN algorithm

Internet has changed the way of communication, which has become more and more concentrated on emails. Emails, text messages and online messenger chatting have become part and parcel of our lives. Out of all these communications, emails are more prone to exploitation. Thus, various email providers employ algorithms to filter emails based on spam and ham. In this research paper, our prime aim is to detect text as well as image based spam emails. To achieve the objective we applied three algorithms namely: KNN algorithm, Naïve Bayes algorithm and reverse DBSCAN algorithm. Pre-processing of email text before executing the algorithms is used to make them predict better. This paper uses Enron corpus's dataset of spam and ham emails. In this research paper, we provide comparison performance of all three algorithms based on four measuring factors namely: precision, sensitivity, specificity and accuracy. We are able to attain good accuracy by all the three algorithms. The results have shown comparison of all three algorithms applied on same data set.

[1]  Tu Minh Phuong,et al.  An Efficient Method for Filtering Image-Based Spam , 2007, 2007 IEEE International Conference on Research, Innovation and Vision for the Future.

[2]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[3]  M. Chandra,et al.  A Study of Image Spam Filtering Techniques , 2012, 2012 Fourth International Conference on Computational Intelligence and Communication Networks.

[4]  Guoxiang Liu,et al.  The application of data mining in the classification of spam messages , 2012, 2012 International Conference on Computer Science and Information Processing (CSIP).

[5]  Tu Minh Phuong,et al.  An Efficient Method for Filtering Image-Based Spam E-mail , 2007, CAIP.

[6]  Thomas M. Breuel,et al.  The OCRopus open source OCR system , 2008, Electronic Imaging.

[7]  Nick Feamster,et al.  Understanding the network-level behavior of spammers , 2006, SIGCOMM.

[8]  Nick Feamster,et al.  Understanding the network-level behavior of spammers , 2006, SIGCOMM 2006.

[9]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[10]  Ray Smith An Overview of the Tesseract OCR Engine , 2007 .