Better Naive Bayes classification for high‐precision spam detection

Email spam has become a major problem for Internet users and providers. One major obstacle to its eradication is that the potential solutions need to ensure a very low false‐positive rate, which tends to be difficult in practice. We address the problem of low‐FPR classification in the context of naive Bayes, which represents one of the most popular machine learning models applied in the spam filtering domain. Drawing from the recent extensions, we propose a new term weight aggregation function, which leads to markedly better results than the standard alternatives. We identify short instances as ones with disproportionally poor performance and counter this behavior with a collaborative filtering‐based feature augmentation. Finally, we propose a tree‐based classifier cascade for which decision thresholds of the leaf nodes are jointly optimized for the best overall performance. These improvements, both individually and in aggregate, lead to substantially better detection rate of precision when compared with some of the best variants of naive Bayes proposed to date. Copyright © 2009 John Wiley & Sons, Ltd.

[1]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[2]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[3]  Keiichiro Hoashi,et al.  Query expansion based on predictive algorithms for collaborative filtering , 2001, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[4]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[5]  D. Madigan,et al.  Bayesian Model Averaging for Linear Regression Models , 1997 .

[6]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[7]  David Haussler,et al.  Proceedings of the fifth annual workshop on Computational learning theory , 1992, COLT 1992.

[8]  Huan Liu,et al.  Bias analysis in text classification for highly skewed data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[9]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[10]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[11]  R. Iman,et al.  Approximations of the critical region of the fbietkan statistic , 1980 .

[12]  Vangelis Metsis,et al.  Spam Filtering with Naive Bayes - Which Naive Bayes? , 2006, CEAS.

[13]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[14]  Gordon V. Cormack,et al.  Batch and Online Spam Filter Comparison , 2006, CEAS.

[15]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[16]  Thomas Serre,et al.  Hierarchical classification and feature reduction for fast face detection with support vector machines , 2003, Pattern Recognit..

[17]  Eric Horvitz,et al.  Considering Cost Asymmetry in Learning Classifiers , 2006, J. Mach. Learn. Res..

[18]  Yang Song,et al.  Boosting the Feature Space: Text Classification for Unstructured Data on the Web , 2006, Sixth International Conference on Data Mining (ICDM'06).

[19]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[20]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[21]  Gary Robinson,et al.  A statistical approach to the spam problem , 2003 .

[22]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[23]  Alessandro Sperduti,et al.  Theoretical and Experimental Analysis of a Two-Stage System for Classification , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  Xiaoli Li,et al.  A refinement approach to handling model misfit in text categorization , 2002, KDD.

[25]  Joshua Alspector,et al.  SVM-based Filtering of E-mail Spam with Content-specic Misclassication Costs , 2001 .

[26]  Philipp Birken,et al.  Numerical Linear Algebra , 2011, Encyclopedia of Parallel Computing.

[27]  Philip S. Yu,et al.  Clustering through decision tree construction , 2000, CIKM '00.

[28]  Enrico Blanzieri,et al.  Instance-Based Spam Filtering Using SVM Nearest Neighbor Classifier , 2007, FLAIRS Conference.

[29]  Hae-Chang Rim,et al.  Some Effective Techniques for Naive Bayes Text Classification , 2006, IEEE Transactions on Knowledge and Data Engineering.

[30]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[31]  Wen-tau Yih,et al.  Raising the baseline for high-precision text classifiers , 2007, KDD '07.

[32]  Geoff Hulten,et al.  Learning at Low False Positive Rates , 2006, CEAS.

[33]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[34]  Gordon V. Cormack,et al.  TREC 2006 Spam Track Overview , 2006, TREC.

[35]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[36]  David Madigan,et al.  On the Naive Bayes Model for Text Categorization , 2003, AISTATS.

[37]  Jason D. M. Rennie Improving multi-class text classification with Naive Bayes , 2001 .

[38]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[39]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[40]  João Gama,et al.  Cascade Generalization , 2000, Machine Learning.

[41]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[42]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[43]  Aleksander Kolcz,et al.  Local sparsity control for naive Bayes with extreme misclassification costs , 2005, KDD '05.

[44]  David Heckerman,et al.  Empirical Analysis of Predictive Algorithms for Collaborative Filtering , 1998, UAI.

[45]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[46]  Ching Y. Suen,et al.  A novel cascade ensemble classifier system with a high recognition performance on handwritten digits , 2007, Pattern Recognit..

[47]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[48]  D. Sculley,et al.  Relaxed online SVMs for spam filtering , 2007, SIGIR.

[49]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[50]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.