Grindstone4Spam: An optimization toolkit for boosting e-mail classification

Resulting from the huge expansion of Internet usage, the problem of unsolicited commercial e-mail (UCE) has grown astronomically. Although a good number of successful content-based anti-spam filters are available, their current utilization in real scenarios is still a long way off. In this context, the SpamAssassin filter offers a rule-based framework that can be easily used as a powerful integration and deployment tool for the fast development of new anti-spam strategies. This paper presents Grindstone4Spam, a publicly available optimization toolkit for boosting SpamAssassin performance. Its applicability has been verified by comparing its results with those obtained by the default SpamAssassin software as well as four well-known anti-spam filtering techniques such as Naive Bayes, Flexible Bayes, Adaboost and Support Vector Machines in two different case studies. The performance of the proposed alternative clearly outperforms existing approaches working in a cost-sensitive scenario.

[1]  Mads Haahr,et al.  A Case-Based Approach to Spam Filtering that Can Track Concept Drift , 2003 .

[2]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[3]  Jonathan Helfman,et al.  Ishmail: Immediate Identification of Important Information , 1995 .

[4]  Miguel Rocha,et al.  A Comparative Impact Study of Attribute Selection Techniques on Naïve Bayes Spam Filters , 2008, ICDM.

[5]  Fayez Gebali,et al.  Targeting spam control on middleboxes: Spam detection based on layer-3 e-mail content classification , 2009, Comput. Networks.

[6]  Juan M. Corchado,et al.  Managing irrelevant knowledge in CBR models for unsolicited e-mail classification , 2009, Expert Syst. Appl..

[7]  Kartik Gopalan,et al.  DMTP: Controlling spam through message delivery differentiation , 2006, Comput. Networks.

[8]  Padraig Cunningham,et al.  An Assessment of Case-Based Reasoning for Spam Filtering , 2005, Artificial Intelligence Review.

[9]  Jason D. M. Rennie ifile: An Application of Machine Learning to E-Mail Filtering , 2000 .

[10]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[11]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[12]  A. Gupta,et al.  A Bayesian Approach to , 1997 .

[13]  Juan M. Corchado,et al.  A Comparative Performance Study of Feature Selection Methods for the Anti-spam Filtering Domain , 2006, ICDM.

[14]  Georgios Paliouras,et al.  A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists , 2004, Information Retrieval.

[15]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[16]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[17]  Florentino Fernández Riverola,et al.  SDAI: An integral evaluation methodology for content-based spam filtering models , 2012, Expert Syst. Appl..

[18]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[19]  Tsuhan Chen,et al.  A collaborative anti-spam system , 2009, Expert Syst. Appl..

[20]  Georgios Paliouras,et al.  Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach , 2000, ArXiv.

[21]  Juan M. Corchado,et al.  Applying lazy learning algorithms to tackle concept drift in spam filtering , 2007, Expert Syst. Appl..

[22]  Chung Keung Poon,et al.  Using phrases as features in email classification , 2009, J. Syst. Softw..

[23]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[24]  Florentino Fernández Riverola,et al.  The Impact of Noise in Spam Filtering: A Case Study , 2008, ICDM.

[25]  Juan M. Corchado,et al.  SpamHunting: An instance-based reasoning system for spam labelling and filtering , 2007, Decis. Support Syst..

[26]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[27]  Lawrence. Davis,et al.  Handbook Of Genetic Algorithms , 1990 .

[28]  Padraig Cunningham,et al.  A case-based technique for tracking concept drift in spam filtering , 2004, Knowl. Based Syst..

[29]  KarkaletsisVangelis,et al.  A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists , 2003 .

[30]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[31]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[32]  José María Gómez Hidalgo,et al.  Combining Text and Heuristics for Cost-Sensitive Spam Filtering , 2000, CoNLL/LLL.

[33]  Kevin R. Gee Using latent semantic indexing to filter spam , 2003, SAC '03.

[34]  Meng Weng Wong,et al.  Sender Policy Framework (SPF) for Authorizing Use of Domains in E-Mail, Version 1 , 2006, RFC.

[35]  Anirban Dasgupta,et al.  Enhanced email spam filtering through combining similarity graphs , 2011, WSDM '11.

[36]  Walmir M. Caminhas,et al.  A review of machine learning approaches to Spam filtering , 2009, Expert Syst. Appl..

[37]  Georgios Paliouras,et al.  Learning to Filter Unsolicited Commercial E-Mail , 2006 .

[38]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[39]  Yong Hu,et al.  A scalable intelligent non-content-based spam-filtering framework , 2010, Expert Syst. Appl..

[40]  Nostrand Reinhold,et al.  the utility of using the genetic algorithm approach on the problem of Davis, L. (1991), Handbook of Genetic Algorithms. Van Nostrand Reinhold, New York. , 1991 .

[41]  Trevor Hastie,et al.  Additive Logistic Regression : a Statistical , 1998 .

[42]  Florentino Fernández Riverola,et al.  Analyzing the Performance of Spam Filtering Methods When Dimensionality of Input Vector Changes , 2007, MLDM.