Spam Filtering without Text Analysis

Our paper introduces a new way to filter spam using as background the Kolmogorov complexity theory and as learning component a Support Vector Machine. Our idea is to skip the classical text analysis in use with standard filtering techniques, and to focus on the measure of the informative content of a message to classify it as spam or legitimate. Exploiting the fact that we can estimate a message information content through compression techniques, we represent an e-mail as a multi-dimensional real vector and we train a Support Vector Machine to get a classifier achieving accuracy rates in the range of 90%-97%, bringing our combined technique at the top of the current spam filtering technologies.

[1]  Michael A. Arbib,et al.  The handbook of brain theory and neural networks , 1995, A Bradford book.

[2]  D. Huffman A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[3]  Lawrence V. Saxton,et al.  Filtering Spam Using Kolmogorov Complexity Estimates , 2007, 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW'07).

[4]  Paul M. B. Vitányi,et al.  The miraculous universal distribution , 1997 .

[5]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[6]  S. F. Bush,et al.  Active Network Management, Kolmogorov Complexity, and Streptichrons , 2000 .

[7]  Péter Gács,et al.  Information Distance , 1998, IEEE Trans. Inf. Theory.

[8]  William I. Gasarch,et al.  Book Review: An introduction to Kolmogorov Complexity and its Applications Second Edition, 1997 by Ming Li and Paul Vitanyi (Springer (Graduate Text Series)) , 1997, SIGACT News.

[9]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[10]  Stephen F. Bush,et al.  Active virtual network management prediction: complexity as a framework for prediction, optimization, and assurance , 2002, Proceedings DARPA Active Networks Conference and Exposition.

[11]  Blaz Zupan,et al.  Spam Filtering Using Statistical Data Compression Models , 2006, J. Mach. Learn. Res..

[12]  Judea Pearl,et al.  Bayesian Networks , 1998, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[13]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[14]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.