Advances in online learning-based spam filtering

The low cost of digital communication has given rise to the problem of email spam, which is unwanted, harmful, or abusive electronic content. In this thesis, we present several advances in the application of online machine learning methods for automatically filtering spam. We detail a sliding-window variant of Support Vector Machines that yields state of the art results for the standard online filtering task. We explore a variety of feature representations for spam data. We reduce human labeling cost through the use of efficient online active learning variants. We give practical solutions to the one-sided feedback scenario, in which users only give labeling feedback on messages predicted to be non-spam. We investigate the impact of class label noise on machine learning-based spam filters, showing that previous benchmark evaluations rewarded filters prone to overfitting in real-world settings and proposing several modifications for combating these negative effects. Finally, we investigate the performance of these filtering methods on the more challenging task of abuse filtering in blog comments. Together, these contributions enable more accurate spam filters to be deployed in real-world settings, with greater robustness to noise, lower computation cost and lower human labeling cost.

[1]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[2]  Stefien Bickel,et al.  ECML-PKDD Discovery Challenge 2006 Overview , 2006 .

[3]  D. Sculley,et al.  Online Active Learning Methods for Fast Label-Efficient Spam Filtering , 2007, CEAS.

[4]  Alexander J. Smola,et al.  Online learning with kernels , 2001, IEEE Transactions on Signal Processing.

[5]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[6]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[7]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[8]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[9]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[10]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[11]  Kiri Wagstaff,et al.  Alpha seeding for support vector machines , 2000, KDD '00.

[12]  Gilad Mishne,et al.  Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.

[13]  Geoff Hulten,et al.  Learning at Low False Positive Rates , 2006, CEAS.

[14]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[15]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[16]  D. Sculley,et al.  Relaxed online SVMs for spam filtering , 2007, SIGIR.

[17]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[18]  Carla E. Brodley,et al.  Spam Filtering Using Inexact String Matching in Explicit Feature Space with On-Line Linear Classifiers , 2006, TREC.

[19]  R. Nigel Horspool,et al.  Data Compression Using Dynamic Markov Modelling , 1987, Comput. J..

[20]  Timothy W. Finin,et al.  SVMs for the Blogosphere: Blog Identification and Splog Detection , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[21]  Gordon V. Cormack,et al.  Online supervised spam filter evaluation , 2007, TOIS.

[22]  D. Sculley On Free Speech and Civil Discourse: Filtering Abuse in Blog Comments , 2008, CEAS.

[23]  Richard Segal,et al.  Fast Uncertainty Sampling for Labeling Large E-mail Corpora , 2006, CEAS.

[24]  Gordon V. Cormack,et al.  Spam and the ongoing battle for the inbox , 2007, CACM.

[25]  Gábor Lugosi,et al.  Minimizing Regret with Label Efficient Prediction , 2004, COLT.

[26]  Blaz Zupan,et al.  Spam Filtering Using Statistical Data Compression Models , 2006, J. Mach. Learn. Res..

[27]  Carla E. Brodley,et al.  Compression and machine learning: a new perspective on feature space vectors , 2006, Data Compression Conference (DCC'06).

[28]  Tony R. Martinez,et al.  An algorithm for correcting mislabeled data , 2001, Intell. Data Anal..

[29]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[30]  Christopher Meek,et al.  Good Word Attacks on Statistical Spam Filters , 2005, CEAS.

[31]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[32]  Cristian S. Calude The mathematical theory of information , 2007 .

[33]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[34]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[35]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[36]  Gordon V. Cormack University of Waterloo Participation in the TREC 2007 Spam Track , 2007, TREC.

[37]  Gordon V. Cormack,et al.  TREC 2006 Spam Track Overview , 2006, TREC.

[38]  Salvatore J. Stolfo,et al.  Combining email models for false positive reduction , 2005, KDD '05.

[39]  Fidelis Assis OSBF-Lua - A Text Classification Module for Lua: The Importance of the Training Method , 2006, TREC.

[40]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[41]  Gordon V. Cormack,et al.  Semi-supervised spam filtering: does it work? , 2008, SIGIR '08.

[42]  Hou-Kuan Huang,et al.  Active learning with simplified SVMs for spam categorization , 2002, Proceedings. International Conference on Machine Learning and Cybernetics.

[43]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[44]  Yun Chi,et al.  Splog detection using self-similarity analysis on blog temporal dynamics , 2007, AIRWeb '07.

[45]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[46]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[47]  W. Krauth,et al.  Learning algorithms with optimal stability in neural networks , 1987 .

[48]  Shyhtsun Felix Wu,et al.  On Attacking Statistical Spam Filters , 2004, CEAS.

[49]  Philip M. Long,et al.  Apple Tasting , 2000, Inf. Comput..

[50]  Gordon V. Cormack,et al.  Spam filtering for short messages , 2007, CIKM '07.

[51]  Adam Tauman Kalai,et al.  Analysis of Perceptron-Based Active Learning , 2009, COLT.

[52]  Gunnar Rätsch,et al.  Large scale genomic sequence SVM classifiers , 2005, ICML.

[53]  Hongyuan Zha,et al.  Exploring Support Vector Machines and Random Forests for Spam Detection , 2004, CEAS.

[54]  Alek Kolcz,et al.  Improve Spam Filtering by Detecting Gray Mail , 2007, CEAS.

[55]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[56]  D. Sculley,et al.  Relaxed Online SVMs in the TREC Spam Filtering Track , 2007, TREC.

[57]  Gordon V. Cormack,et al.  Spam Corpus Creation for TREC , 2005, CEAS.

[58]  P. Bartlett,et al.  Probabilities for SV Machines , 2000 .

[59]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[60]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[61]  D. Sculley,et al.  Filtering Email Spam in the Presence of Noisy User Feedback , 2008, CEAS.

[62]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[63]  Peter J. Denning,et al.  ACM president's letter: electronic junk , 1982, CACM.

[64]  Vangelis Metsis,et al.  Spam Filtering with Naive Bayes - Which Naive Bayes? , 2006, CEAS.

[65]  Carla E. Brodley,et al.  Class Noise Mitigation Through Instance Weighting , 2007, ECML.

[66]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[67]  Claudio Gentile,et al.  Worst-Case Analysis of Selective Sampling for Linear Classification , 2006, J. Mach. Learn. Res..

[68]  Sanjoy Dasgupta,et al.  Analysis of a greedy active learning strategy , 2004, NIPS.

[69]  Gert Cauwenberghs,et al.  Incremental and Decremental Support Vector Machine Learning , 2000, NIPS.

[70]  Steven Salzberg,et al.  On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach , 1997, Data Mining and Knowledge Discovery.

[71]  Joshua Goodman,et al.  Online Discriminative Spam Filter Training , 2006, CEAS.

[72]  Gordon V. Cormack,et al.  Batch and Online Spam Filter Comparison , 2006, CEAS.

[73]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[74]  Christina S. Leslie,et al.  Fast String Kernels using Inexact Matching for Protein Sequences , 2004, J. Mach. Learn. Res..

[75]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[76]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[77]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[78]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[79]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[80]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[81]  Philip M. Long,et al.  Practical learning from one-sided feedback , 2007, KDD '07.

[82]  Roni Khardon,et al.  Noise Tolerant Variants of the Perceptron Algorithm , 2007, J. Mach. Learn. Res..

[83]  David P. Helmbold,et al.  Some label efficient learning results , 1997, COLT '97.

[84]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[85]  Gordon V. Cormack,et al.  On-line spam filter fusion , 2006, SIGIR.

[86]  Tom. Mitchell GENERATIVE AND DISCRIMINATIVE CLASSIFIERS: NAIVE BAYES AND LOGISTIC REGRESSION Machine Learning , 2005 .

[87]  Joshua Alspector,et al.  SVM-based Filtering of E-mail Spam with Content-specic Misclassication Costs , 2001 .

[88]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .