Filtering Spam Using Kolmogorov Complexity Estimates

This paper introduces an adaptive filter which filters spam email based on Kolmogorov complexity estimates. The complexity filter is first trained exactly like a Bayesian filter. Each email is mapped to a string representation in which the tokens or words are represented by either 0 or 1. Tokens associated with spam are represented by 1 whereas those associated with non-spam, or ham, are represented by 0. Common tokens are ignored. The Kolmogorov complexity of this string representation is estimated using run-length compression. If the resulting Kolmogorov complexity is low then the email is classified as spam. Otherwise the email is classified as ham. The complexity filter can filter messages almost twice as fast as a comparable Bayesian filter and achieve accuracy rates of 80% to 96% While a Bayesian filter views an email as a "bag of words", the complexity filter uses token distribution information and is likely less vulnerable to statistical attack.

[1]  Constantine D. Spyropoulos,et al.  An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[2]  A. Turing On Computable Numbers, with an Application to the Entscheidungsproblem. , 1937 .

[3]  S. C. Evans,et al.  Network security through conservation of complexity , 2002, MILCOM 2002. Proceedings.

[4]  D. Knill,et al.  The Bayesian brain: the role of uncertainty in neural coding and computation , 2004, Trends in Neurosciences.

[5]  Jeffrey O. Kephart,et al.  SpamGuru: An Enterprise Anti-Spam Filtering System , 2004, CEAS.

[6]  Stephen F. Bush,et al.  Detecting Distributed Denial-of-Service Attacks Using Kolmogorov Complexity Metrics , 2005, Journal of Network and Systems Management.

[7]  Paul M. B. Vitányi,et al.  Similarity of Objects and the Meaning of Words , 2006, TAMC.

[8]  Paula Bruening Technological Responses to the Problem of Spam: Preserving Free Speech and Open Internet Values , 2004, CEAS.

[9]  M. Li,et al.  Melody Classification using a Similarity Metric based on Kolmogorov Complexity , 2004 .

[10]  Georgios Paliouras,et al.  Learning to Filter Unsolicited Commercial E-Mail , 2006 .

[11]  Stephen F. Bush,et al.  On The Effectiveness of Kolmogorov Complexity Estimation to Discriminate Semantic Types , 2005, ArXiv.

[12]  Alexander Gammerman,et al.  Kolmogorov Complexity: Sources, Theory and Applications , 1999, Comput. J..

[13]  Jonathan A. Zdziarski,et al.  Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification , 2005 .

[14]  Gordon V. Cormack,et al.  Spam Corpus Creation for TREC , 2005, CEAS.

[15]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[16]  Tony A. Meyer,et al.  SpamBayes: Effective open-source, Bayesian based, email classification system , 2004, CEAS.

[17]  Paul M. B. Vitányi,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1993, Graduate Texts in Computer Science.

[18]  Isidore Rigoutsos,et al.  Chung-Kwei: a Pattern-discovery-based System for the Automatic Identification of Unsolicited E-mail Messages (SPAM) , 2004, CEAS.

[19]  Tony Andrew Meyer A TREC Along the Spam Track with SpamBayes , 2005, TREC.

[20]  Xin Chen,et al.  Shared information and program plagiarism detection , 2004, IEEE Transactions on Information Theory.

[21]  Bin Ma,et al.  Chain letters & evolutionary histories. , 2003, Scientific American.

[22]  Christopher Meek,et al.  Good Word Attacks on Statistical Spam Filters , 2005, CEAS.

[23]  Geoff Hulten,et al.  Trends in Spam Products and Methods , 2004, CEAS.