Spam email filtering with bayesian belief network: using relevant words

In this paper, we report our work on a Bayesian Belief Network approach to spam email filtering (classifying email as spam or nonspam/legitimate). Our evaluation suggests that a Bayesian Belief Network based classifier will outperform the popular Naive Bayes approach and two other famous learners: decision tree and k-NN. These four algorithms are tested on two different data sets with three different feature selection methods (Information Gain, Gain Ratio and Chi Squared) for finding relevant words. 10-fold cross-validation results show that Bayesian Belief Network performs best on both datasets. We suggest that this is because the 'dependant learner' characteristics of Bayesian Belief Network classification are more suited to spam filtering. The performance of the Bayesian Belief Network classifier could be further improved by careful feature subset selection.

[1]  David Maxwell Chickering,et al.  Learning Bayesian Networks is , 1994 .

[2]  Helge Langseth,et al.  Bayesian Networks in Reliability: Some Recent Developments , 2004 .

[3]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[4]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[5]  Georgios Paliouras,et al.  Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach , 2000, ArXiv.

[6]  Lluís Màrquez i Villodre,et al.  Boosting Trees for Anti-Spam Email Filtering , 2001, ArXiv.

[7]  William W. Cohen Learning Rules that Classify E-Mail , 1996 .

[8]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[9]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[10]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[11]  Joshua Alspector,et al.  SVM-based Filtering of E-mail Spam with Content-specic Misclassication Costs , 2001 .

[12]  Georgios Paliouras,et al.  Stacking Classifiers for Anti-Spam Filtering of E-Mail , 2001, EMNLP.

[13]  Karl-Michael Schneider,et al.  A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering , 2003, EACL.

[14]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[15]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[16]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[17]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[18]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[19]  Tianshun Yao,et al.  An evaluation of statistical spam filtering techniques , 2004, TALIP.

[20]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[21]  Georgios Paliouras,et al.  A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists , 2004, Information Retrieval.

[22]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1998, Learning in Graphical Models.

[23]  Nir Friedman,et al.  Data Analysis with Bayesian Networks: A Bootstrap Approach , 1999, UAI.