Using online linear classifiers to filter spam emails

The performance of two online linear classifiers—the Perceptron and Littlestone’s Winnow—is explored for two anti-spam filtering benchmark corpora—PU1 and Ling-Spam. We study the performance for varying numbers of features, along with three different feature selection methods: information gain (IG), document frequency (DF) and odds ratio. The size of the training set and the number of training iterations are also investigated for both classifiers. The experimental results show that both the Perceptron and Winnow perform much better when using IG or DF than using odds ratio. It is further demonstrated that when using IG or DF, the classifiers are insensitive to the number of features and the number of training iterations, and not greatly sensitive to the size of training set. Winnow is shown to slightly outperform the Perceptron. It is also demonstrated that both of these online classifiers perform much better than a standard Naïve Bayes method. The theoretical and implementation computational complexity of these two classifiers are very low, and they are very easily adaptively updated. They outperform most of the published results, while being significantly easier to train and adapt. The analysis and promising experimental results indicate that the Perceptron and Winnow are two very competitive classifiers for anti-spam filtering.

[1]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[2]  Núria Bel,et al.  Cross-Lingual Text Categorization , 2003, ECDL.

[3]  Ido Dagan,et al.  Mistake-Driven Learning in Text Categorization , 1997, EMNLP.

[4]  S. J. Vaughan-Nichols Saving private e-mail , 2003 .

[5]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[6]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[7]  Prasad Tadepalli,et al.  Active Learning with Committees in Text Categorization: Preliminary Results in Comparing Winnow and Perceptron , 1998 .

[8]  Constantine D. Spyropoulos,et al.  An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[9]  Dale Schuurmans,et al.  General Convergence Results for Linear Discriminant Updates , 1997, COLT '97.

[10]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[11]  Georgios Paliouras,et al.  Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach , 2000, ArXiv.

[12]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[13]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[14]  Brian Whitworth,et al.  Spam and the social-technical gap , 2004, Computer.

[15]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[16]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[17]  Georgios Paliouras,et al.  Learning to Filter Unsolicited Commercial E-Mail , 2006 .

[18]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[19]  Tong Zhang,et al.  Regularized Winnow Methods , 2000, NIPS.

[20]  Bin Liu,et al.  TREC 11 Experiments at CAS-ICT: Filtering and Web , 2002, TREC.

[21]  José María Gómez Hidalgo,et al.  Evaluating cost-sensitive Unsolicited Bulk Email categorization , 2002, SAC '02.

[22]  Lluís Màrquez i Villodre,et al.  Boosting Trees for Anti-Spam Email Filtering , 2001, ArXiv.

[23]  Butler W. Lampson,et al.  31. Paper: Computer Security in the Real World Computer Security in the Real World , 2022 .

[24]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[25]  Manfred K. Warmuth,et al.  The Perceptron Algorithm Versus Winnow: Linear Versus Logarithmic Mistake Bounds when Few Input Variables are Relevant (Technical Note) , 1997, Artif. Intell..

[26]  Karl-Michael Schneider,et al.  A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering , 2003, EACL.

[27]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[28]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[29]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).