An intelligent three-phase spam filtering method based on decision tree data mining

In this paper, we proposed an efficient spam filtering method based on decision tree data mining technique, analyzed the association rules about spams, and applied these rules to develop a systematized spam filtering method. Our method possessed the following three major superiorities: i checking only an e-mail's header section to avoid the low-operating efficiency in scanning an e-mail's content. Moreover, the accuracy of filtering was enhanced simultaneously. ii In order that the probable misjudgment in identifying an unknown e-mail could be "reversed", we had constructed a reversing mechanism to help the classification of unknown e-mails. Thus, the overall accuracy of our filtering method will be increased. iii Our method was equipped with a re-learning mechanism, which utilized the supervised machine learning method to collect and analyze each misjudged e-mail. Therefore, the revision information learned from the analysis of misjudged e-mails incrementally gave feedback to our method, and its ability of identifying spams would be improved. Copyright © 2016 John Wiley & Sons, Ltd.

[1]  Yiyu Yao,et al.  Cost-sensitive three-way email spam filtering , 2013, Journal of Intelligent Information Systems.

[2]  A. Walairacht,et al.  Adaptive Spai Mail Filtering Using Genetic Algorithm , 2006, 2006 8th International Conference Advanced Communication Technology.

[3]  Binshan Lin,et al.  Collaborative spam filtering with heterogeneous agents , 2008, Expert Syst. Appl..

[4]  Jyh-Jian Sheu,et al.  An efficient spam filtering method by analyzing e-mail’s header session only , 2009 .

[5]  Chih-Chien Wang,et al.  Using header session messages to anti-spamming , 2007, Comput. Secur..

[6]  Juan M. Corchado,et al.  SpamHunting: An instance-based reasoning system for spam labelling and filtering , 2007, Decis. Support Syst..

[7]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[8]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[9]  Lluís Màrquez i Villodre,et al.  Boosting Trees for Anti-Spam Email Filtering , 2001, ArXiv.

[10]  Juan M. Corchado,et al.  Applying lazy learning algorithms to tackle concept drift in spam filtering , 2007, Expert Syst. Appl..

[11]  Konrad Lang,et al.  Evaluation of automatic knowledge acquisition techniques in the diagnosis of acute abdominal pain - Acute Abdominal Pain Study Group , 1996, Artif. Intell. Medicine.

[12]  James A. Hendler,et al.  Reputation Network Analysis for Email Filtering , 2004, CEAS.

[13]  Te-Ming Chang,et al.  An incremental cluster-based approach to spam filtering , 2008, Expert Syst. Appl..

[14]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[15]  Fei He,et al.  An Expanded Feature Extraction of E-Mail Header for Spam Recognition , 2013 .

[16]  Yuwan Gu,et al.  Bayesian Spam Filtering Mechanism Based on Decision Tree of Attribute Set Dependence in the MapReduce Framework , 2014 .

[17]  Konstantin Tretyakov,et al.  Machine Learning Techniques in Spam Filtering , 2004 .

[18]  Harry Wechsler,et al.  Spam detection using Random Boost , 2012, Pattern Recognit. Lett..

[19]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[20]  Joel Scanlan,et al.  Catching spam before it arrives: domain specific dynamic blacklists , 2006, ACSW.

[21]  Chih-Chin Lai,et al.  An empirical study of three machine learning methods for spam filtering , 2007, Knowl. Based Syst..

[22]  Jyh-Jian Sheu An Efficient Two-phase Spam Filtering Method Based on E-mails Categorization , 2009, Int. J. Netw. Secur..

[23]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[24]  C. Siva Ram Murthy,et al.  Loss classification in optical burst switching networks using machine learning techniques: improving the performance of TCP , 2008, IEEE Journal on Selected Areas in Communications.

[25]  Katharina D. C. Stärk,et al.  The application of non-parametric techniques to solve classification problems in complex data sets in veterinary epidemiology - An example , 1999, Intell. Data Anal..

[26]  Jitendra Nath Shrivastava,et al.  E-mail Spam Filtering Using Adaptive Genetic Algorithm , 2014 .

[27]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[28]  Padraig Cunningham,et al.  A case-based technique for tracking concept drift in spam filtering , 2004, Knowl. Based Syst..

[29]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[30]  Minyi Guo,et al.  An innovative analyser for multi-classifier e-mail classification based on grey list analysis , 2009, J. Netw. Comput. Appl..

[31]  Constantine D. Spyropoulos,et al.  An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[32]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.