Using GMDH-based networks for improved spam detection and email feature analysis

Unsolicited or spam email has recently become a major threat that can negatively impact the usability of electronic mail. Spam substantially wastes time and money for business users and network administrators, consumes network bandwidth and storage space, and slows down email servers. In addition, it provides a medium for distributing harmful code and/or offensive content. In this paper, we explore the application of the GMDH (Group Method of Data Handling) based inductive learning approach in detecting spam messages by automatically identifying content features that effectively distinguish spam from legitimate emails. We study the performance for various network model complexities using spambase, a publicly available benchmark dataset. Results reveal that classification accuracies of 91.7% can be achieved using only 10 out of the available 57 attributes, selected through abductive learning as the most effective feature subset (i.e. 82.5% data reduction). We also show how to improve classification performance using abductive network ensembles (committees) trained on different subsets of the training data. Comparison with other techniques such as neural networks and naive Bayesian classifiers shows that the GMDH-based learning approach can provide better spam detection accuracy with false-positive rates as low as 4.3% and yet requires shorter training time.

[1]  Anurag Agarwal Abductive Networks For Two-Group Classification: A Comparison With Neural Networks , 2011 .

[2]  D. Jimenez,et al.  Dynamically weighted ensemble neural networks for classification , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[3]  Virgílio A. F. Almeida,et al.  Workload models of spam and legitimate e-mails , 2007, Perform. Evaluation.

[4]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[5]  Marios D. Dikaiakos,et al.  On the properties of spam-advertised URL addresses , 2008, J. Netw. Comput. Appl..

[6]  El-Sayed M. El-Alfy,et al.  A fuzzy similarity approach for automated spam filtering , 2008, 2008 IEEE/ACS International Conference on Computer Systems and Applications.

[7]  Lluís Màrquez i Villodre,et al.  Boosting Trees for Anti-Spam Email Filtering , 2001, ArXiv.

[8]  Ray Hunt,et al.  Tightening the net: A review of current and next generation spam filtering tools , 2006, Comput. Secur..

[9]  Te-Ming Chang,et al.  An incremental cluster-based approach to spam filtering , 2008, Expert Syst. Appl..

[10]  Eduardo Conde,et al.  An HMM for detecting spam mail , 2007, Expert Syst. Appl..

[11]  Hongyuan Zha,et al.  Exploring Support Vector Machines and Random Forests for Spam Detection , 2004, CEAS.

[12]  Constantine D. Spyropoulos,et al.  An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[13]  Dave C. Trudgian Spam Classification Using Nearest Neighbour Techniques , 2004, IDEAL.

[14]  Georgios Paliouras,et al.  A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists , 2004, Information Retrieval.

[15]  Stanley J. Farlow,et al.  Self-Organizing Methods in Modeling: Gmdh Type Algorithms , 1984 .

[16]  Muhammad E. Shaaban,et al.  Identifying junk electronic mail in Microsoft outlook with a support vector machine , 2003, 2003 Symposium on Applications and the Internet, 2003. Proceedings..

[17]  GordilloJosé,et al.  An HMM for detecting spam mail , 2007 .

[18]  R. E. Abdel-Aal,et al.  GMDH-based feature ranking and selection for improved classification of medical data , 2005, J. Biomed. Informatics.

[19]  Chih-Chin Lai,et al.  An empirical study of three machine learning methods for spam filtering , 2007, Knowl. Based Syst..

[20]  William W. Cohen Learning Rules that Classify E-Mail , 1996 .

[21]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[22]  Jefferson Provost,et al.  Na ive-Bayes vs. Rule-Learning in Classification of Email , 1999 .

[23]  Zili Zhang,et al.  An email classification model based on rough set theory , 2005, Proceedings of the 2005 International Conference on Active Media Technology, 2005. (AMT 2005)..

[24]  M. Basu,et al.  Gating improves neural network performance , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[25]  El-Sayed M. El-Alfy,et al.  Construction and analysis of educational tests using abductive machine learning , 2008, Comput. Educ..

[26]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[27]  Keith C. Drake,et al.  Abductive reasoning networks , 1991, Neurocomputing.

[28]  Sanjay P. Ahuja,et al.  Anti-Spam Filtering Using Neural Networks , 2004, IC-AI.

[29]  Debzani Deb,et al.  A Trainable Fuzzy Spam Detection System , 2004 .

[30]  Georgios Paliouras,et al.  Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach , 2000, ArXiv.

[31]  Nigel M. Allinson,et al.  Fast committee learning: preliminary results , 1998 .

[32]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[33]  Vasilios Zorkadis,et al.  Efficient information theoretic strategies for classifier combination, feature extraction and performance evaluation in improving false positives and false negatives for spam e-mail filtering , 2005, Neural Networks.

[34]  Sara Sinclair Adapting Bayesian statistical spam filters to the server side , 2004 .

[35]  Nathaniel S. Borenstein,et al.  A Multifaceted Approach to Spam Reduction , 2004, CEAS.

[36]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[37]  J. Hanley,et al.  A method of comparing the areas under receiver operating characteristic curves derived from the same cases. , 1983, Radiology.

[38]  JianGuo N. Rd Sender and Receiver Addresses as Cues for Anti-Spam , 2004 .

[39]  Irena Koprinska,et al.  A neural network based approach to automated e-mail classification , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[40]  Karl-Michael Schneider,et al.  A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering , 2003, EACL.

[41]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[42]  Emil Sit,et al.  An empirical study of spam traffic and the use of DNS black lists , 2004, IMC '04.