An empirical study on email classification using supervised machine learning in real environments

Spam emails are considered as one of the biggest challenges for the Internet. Thus email classification, which aims to correctly classify legitimate and spam emails, becomes an important topic for both industry and academia. To achieve this goal, machine learning techniques, especially supervised machine learning algorithms, have been extensively applied to this field. In literature, several studies reveal that supervised machine learning (SML) suffers from some limitations such as performance fluctuation, hence many works start focusing on designing more complex algorithms. However, we identify that most existing research efforts are based on datasets, while more research should be conducted to investigate the performance of SML in real environments. In this paper, we thus perform an empirical study with three different environments and over 1,000 users regarding this issue. In the study, we find that SML classifiers like decision tree and SVMs are acceptable by users in real email classification. In addition, we discuss promising directions and provide new insights in this area.

[1]  Wenjuan Li,et al.  Enhancing email classification using data reduction and disagreement-based semi-supervised learning , 2014, 2014 IEEE International Conference on Communications (ICC).

[2]  Insup Lee,et al.  Spam mitigation using spatio-temporal reputations from blacklist history , 2010, ACSAC '10.

[3]  D. Sculley,et al.  Relaxed online SVMs for spam filtering , 2007, SIGIR.

[4]  Aiko Pras,et al.  Evaluating third-party Bad Neighborhood blacklists for Spam detection , 2013, 2013 IFIP/IEEE International Symposium on Integrated Network Management (IM 2013).

[5]  Zhiyuan Tan,et al.  Towards Designing an Email Classification System Using Multi-view Based Semi-supervised Learning , 2014, 2014 IEEE 13th International Conference on Trust, Security and Privacy in Computing and Communications.

[6]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[7]  David G. Schwartz,et al.  Social network analysis of web links to eliminate false positives in collaborative anti-spam systems , 2011, J. Netw. Comput. Appl..

[8]  Jun Guo,et al.  An Approach to Spam Detection by Naive Bayes Ensemble Based on Decision Induction , 2006, Sixth International Conference on Intelligent Systems Design and Applications.

[9]  Ke Gao,et al.  Study on Ensemble Classification Methods towards Spam Filtering , 2009, ADMA.

[10]  Christopher Meek,et al.  Challenges of the Email Domain for Text Classification , 2000, ICML.

[11]  Minyi Guo,et al.  An innovative analyser for multi-classifier e-mail classification based on grey list analysis , 2009, J. Netw. Comput. Appl..

[12]  Fayez Gebali,et al.  Binary LNS-based naive Bayes hardware classifier for spam control , 2006, 2006 IEEE International Symposium on Circuits and Systems.

[13]  Baowen Xu,et al.  Harmonic functions based semi-supervised learning for web spam detection , 2011, SAC '11.

[14]  Farnam Jahanian,et al.  Shades of grey: On the effectiveness of reputation-based “blacklists” , 2008, 2008 3rd International Conference on Malicious and Unwanted Software (MALWARE).

[15]  Nizar Bouguila,et al.  A study of spam filtering using support vector machines , 2010, Artificial Intelligence Review.

[16]  Chunhua Zhang,et al.  Spam filtering with several novel bayesian classifiers , 2008, 2008 19th International Conference on Pattern Recognition.

[17]  Santosh S. Vempala,et al.  Filtering spam with behavioral blacklisting , 2007, CCS '07.

[18]  Kartik Gopalan,et al.  DMTP: Controlling spam through message delivery differentiation , 2006, Comput. Networks.

[19]  Rodica Potolea,et al.  Spam detection filter using KNN algorithm and resampling , 2010, Proceedings of the 2010 IEEE 6th International Conference on Intelligent Computer Communication and Processing.

[20]  Ray Hunt,et al.  Current and New Developments in Spam Filtering , 2006, 2006 14th IEEE International Conference on Networks.

[21]  Yang Xiang,et al.  Email classification using data reduction method , 2010, 2010 5th International ICST Conference on Communications and Networking in China.