Effectiveness and Limitations of Statistical Spam Filters

In this paper we discuss the techniques involved in the design of the famous statistical spam filters that include Naive Bayes, Term Frequency-Inverse Document Frequency, K-Nearest Neighbor, Support Vector Machine, and Bayes Additive Regression Tree. We compare these techniques with each other in terms of accuracy, recall, precision, etc. Further, we discuss the effectiveness and limitations of statistical filters in filtering out various types of spam from legitimate e-mails.

[1]  Tomasz R. Surmacz,et al.  Reliability of e-mail delivery in the era of spam , 2007, 2nd International Conference on Dependability of Computer Systems (DepCoS-RELCOMEX '07).

[2]  Suku Nair,et al.  Bayesian Additive Regression Trees-Based Spam Detection for Enhanced Email Privacy , 2008, 2008 Third International Conference on Availability, Reliability and Security.

[3]  D. Sculley,et al.  Relaxed online SVMs for spam filtering , 2007, SIGIR.

[4]  Calton Pu,et al.  Observed Trends in Spam Construction Techniques: A Case Study of Spam Evolution , 2006, CEAS.

[5]  Enrico Blanzieri,et al.  A survey of learning-based techniques of email spam filtering , 2008, Artificial Intelligence Review.

[6]  Gordon V. Cormack,et al.  Spam and the ongoing battle for the inbox , 2007, CACM.

[7]  Roger Wattenhofer,et al.  Spamato - An Extendable Spam Filter System , 2005, CEAS.

[8]  Georgios Paliouras,et al.  A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists , 2004, Information Retrieval.

[9]  James A. Herson,et al.  Image analysis for efficient categorization of image-based spam e-mail , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[10]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[11]  Vangelis Metsis,et al.  Spam Filtering with Naive Bayes - Which Naive Bayes? , 2006, CEAS.

[12]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[13]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[14]  Chih-Chin Lai,et al.  An empirical performance comparison of machine learning methods for spam e-mail categorization , 2004, Fourth International Conference on Hybrid Intelligent Systems (HIS'04).