论文信息 - Spam Filtering without Text Analysis

Spam Filtering without Text Analysis

Our paper introduces a new way to filter spam using as background the Kolmogorov complexity theory and as learning component a Support Vector Machine. Our idea is to skip the classical text analysis in use with standard filtering techniques, and to focus on the measure of the informative content of a message to classify it as spam or legitimate. Exploiting the fact that we can estimate a message information content through compression techniques, we represent an e-mail as a multi-dimensional real vector and we train a Support Vector Machine to get a classifier achieving accuracy rates in the range of 90%-97%, bringing our combined technique at the top of the current spam filtering technologies.

Gilles Richard | Sihem Belabbes | G. Richard | Sihem Belabbes

[1] Michael A. Arbib,et al. The handbook of brain theory and neural networks , 1995, A Bradford book.

[2] D. Huffman. A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[3] Lawrence V. Saxton,et al. Filtering Spam Using Kolmogorov Complexity Estimates , 2007, 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW'07).

[4] Paul M. B. Vitányi,et al. The miraculous universal distribution , 1997 .

[5] A. Kolmogorov. Three approaches to the quantitative definition of information , 1968 .

[6] S. F. Bush,et al. Active Network Management, Kolmogorov Complexity, and Streptichrons , 2000 .

[7] Péter Gács,et al. Information Distance , 1998, IEEE Trans. Inf. Theory.

[8] William I. Gasarch,et al. Book Review: An introduction to Kolmogorov Complexity and its Applications Second Edition, 1997 by Ming Li and Paul Vitanyi (Springer (Graduate Text Series)) , 1997, SIGACT News.

[9] Christopher J. C. Burges,et al. A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[10] Stephen F. Bush,et al. Active virtual network management prediction: complexity as a framework for prediction, optimization, and assurance , 2002, Proceedings DARPA Active Networks Conference and Exposition.

[11] Blaz Zupan,et al. Spam Filtering Using Statistical Data Compression Models , 2006, J. Mach. Learn. Res..

[12] Judea Pearl,et al. Bayesian Networks , 1998, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[13] D. J. Wheeler,et al. A Block-sorting Lossless Data Compression Algorithm , 1994 .

[14] Terry A. Welch,et al. A Technique for High-Performance Data Compression , 1984, Computer.