Development of anti-spam technique using modified K-Means & Naive Bayes algorithm

In recent years, the issues of expanding spam mail on the web has turned into a major issue and has also become difficult to detect. Junk mails or unsolicited bulk emails are known as spam mails. They may contain malicious content and they are sent to numerous recipients through email. Spam emails also contain the malwares in executable file attachments. At commercial level, many companies hire the spammers to publicize their information regarding the offers, as it is the fastest and cheapest way of advertising. Spammers are the group of the people who apply different techniques to bypass the spam filtering methods. The general classifications of spam filtering techniques are Rule-based classification or Non Machine Learning (NML) which uses set of rules to classify whether the incoming message is spam or not. Content based classification that use machine learning techniques have given quite a promising result. Machine Learning (ML) is concerned with development of algorithms that allow computer to take intelligent decision on the basis of dataset. Some of the commonly used statistical filters are Naïve Bayes, K-Means, Support Vector Machine and TF-IDF. This paper proposes a new approach to detect spam mails using linear approach of Modified K-Means & Naive Bayes classification algorithm and the Modified K-Means algorithm was proposed by Malay K. Pakhira in year 2009 to avoid empty clusters [19] which is used in our approach. This proposed approach offers the advantage w.r.t modified K-means algorithm such as improved classification accuracy, decreasing the number of iteration steps.

[1]  Tomas Sochor,et al.  Overview of e-mail SPAM elimination and its efficiency , 2014, 2014 IEEE Eighth International Conference on Research Challenges in Information Science (RCIS).

[2]  Long Nguyen,et al.  DMEA-II and its application on spam email detection problems , 2014, the 2014 Seventh IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA).

[3]  Malay K. Pakhira,et al.  A Modified k-means Algorithm to Avoid Empty Clusters , 2009 .

[4]  Kang Li,et al.  ALPACAS: A Large-Scale Privacy-Aware Collaborative Anti-Spam System , 2008, IEEE INFOCOM 2008 - The 27th Conference on Computer Communications.

[5]  Jurandy Almeida,et al.  Evaluation of Approaches for Dimensionality Reduction Applied with Naive Bayes Anti-Spam Filters , 2009, 2009 International Conference on Machine Learning and Applications.

[6]  D. P. Rana,et al.  Detecting E-mail Spam Using Spam Word Associations , 2012 .

[7]  Ahamed B M Shafeeq,et al.  Dynamic Clustering of Data with Modified K-Means Algorithm , 2012 .

[8]  Gurmeet Maan,et al.  Enhanced discussion on different techniques of spam detection , 2013 .

[9]  Vidyasagar Potdar,et al.  Toward spam 2.0: An evaluation of Web 2.0 anti-spam methods , 2009, 2009 7th IEEE International Conference on Industrial Informatics.

[10]  Othman Ibrahim,et al.  K-Means Clustering Scheme for Enhanced Spam Detection , 2014 .