Spam Filtering Using Statistical Data Compression Models

Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. In this paper, we investigate a novel approach to spam filtering based on adaptive statistical data compression models. The nature of these models allows them to be employed as probabilistic text classifiers based on character-level or binary sequences. By modeling messages as sequences, tokenization and other error-prone preprocessing steps are omitted altogether, resulting in a method that is very robust. The models are also fast to construct and incrementally updateable. We evaluate the filtering performance of two different compression algorithms; dynamic Markov compression and prediction by partial matching. The results of our empirical evaluation indicate that compression models outperform currently established spam filters, as well as a number of methods proposed in previous studies.

[1]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[2]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[3]  Jorma Rissanen,et al.  Universal coding, information, prediction, and estimation , 1984, IEEE Trans. Inf. Theory.

[4]  Jorma Rissanen,et al.  Complexity of strings in the class of Markov sources , 1986, IEEE Trans. Inf. Theory.

[5]  R. Nigel Horspool,et al.  Data Compression Using Dynamic Markov Modelling , 1987, Comput. J..

[6]  T. Cover,et al.  A sandwich proof of the Shannon-McMillan-Breiman theorem , 1988 .

[7]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[8]  Paul G. Howard,et al.  The design and analysis of efficient lossless data compression systems , 1993 .

[9]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[10]  John G. Cleary,et al.  Unbounded length contexts for PPM , 1995, Proceedings DCC '95 Data Compression Conference.

[11]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[12]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[13]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[14]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[15]  Simson L. Garfinkel,et al.  Stopping Spam , 1998 .

[16]  Sergio VerdÂ,et al.  The Minimum Description Length Principle in Coding and Modeling , 2000 .

[17]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[18]  Ian H. Witten,et al.  Text categorization using compression models , 2000, Proceedings DCC 2000. Data Compression Conference.

[19]  William John Teahan,et al.  Text classification and segmentation using minimum cross-entropy , 2000, RIAO.

[20]  Georgios Paliouras,et al.  Stacking Classifiers for Anti-Spam Filtering of E-Mail , 2001, EMNLP.

[21]  Lluís Màrquez i Villodre,et al.  Boosting Trees for Anti-Spam Email Filtering , 2001, ArXiv.

[22]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[23]  José María Gómez Hidalgo,et al.  Evaluating cost-sensitive Unsolicited Bulk Email categorization , 2002, SAC '02.

[24]  Gary Robinson,et al.  A statistical approach to the spam problem , 2003 .

[25]  Karl-Michael Schneider,et al.  A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering , 2003, EACL.

[26]  David J. Harper,et al.  Using compression based language models for text categorization. , 2003 .

[27]  Isidore Rigoutsos,et al.  Chung-Kwei: a Pattern-discovery-based System for the Automatic Identification of Unsolicited E-mail Messages (SPAM) , 2004, CEAS.

[28]  Konstantin Tretyakov,et al.  Machine Learning Techniques in Spam Filtering , 2004 .

[29]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[30]  Georgios Paliouras,et al.  A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists , 2004, Information Retrieval.

[31]  Paul Graham Hackers and painters - big ideas from the computer age , 2004 .

[32]  Georgios Paliouras,et al.  Filtron: A Learning-Based Anti-Spam Filter , 2004, CEAS.

[33]  Dale Schuurmans,et al.  Augmenting Naive Bayes Classifiers with Statistical Language Models , 2004, Information Retrieval.

[34]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[35]  Peter Grünwald,et al.  A tutorial introduction to the minimum description length principle , 2004, ArXiv.

[36]  Bogdan Filipic,et al.  Spam Filtering Using Character-Level Markov Models: Experiments for the TREC 2005 Spam Track , 2005, TREC.

[37]  Jorma Rissanen,et al.  An MDL Framework for Data Clustering , 2005 .

[38]  Johan Hovold,et al.  Naive Bayes spam filtering using word-position-based attributes and length-sensitive classification thresholds , 2005, CEAS.

[39]  William S. Yerazunis,et al.  CRM114 versus Mr. X: CRM114 Notes for the TREC 2005 Spam Track , 2005, TREC.

[40]  Mark Levene,et al.  A Suffix Tree Approach to Email Filtering , 2005, ArXiv.

[41]  L. A. Breyer DBACL at the TREC 2005 , 2005, TREC.

[42]  Tony Andrew Meyer A TREC Along the Spam Track with SpamBayes , 2005, TREC.

[43]  Bogdan Filipic,et al.  Exploiting structural information for semi-structured document categorization , 2006, Inf. Process. Manag..

[44]  Carla E. Brodley,et al.  Compression and machine learning: a new perspective on feature space vectors , 2006, Data Compression Conference (DCC'06).

[45]  Gordon V. Cormack,et al.  TREC 2006 Spam Track Overview , 2006, TREC.

[46]  Mark Levene,et al.  A suffix tree approach to anti-spam email filtering , 2006, Machine Learning.

[47]  Georgios Paliouras,et al.  Learning to Filter Unsolicited Commercial E-Mail , 2006 .