Chinese spam filtering based on online active learning methods

In this paper, new active learning methods are proposed to filter Chinese spam. It is time-consuming and expensive to label the spam emails in the large datasets. Active learning methods can conspicuously reduce labeling cost by identifying informative examples and speed up online Logistic Regression filter. The experiments illustrate that our methods not only decrease the number of label requests, but also improve the classification performance of spam filtering.

[1]  Claudio Gentile,et al.  Worst-Case Analysis of Selective Sampling for Linear Classification , 2006, J. Mach. Learn. Res..

[2]  D. Sculley,et al.  Relaxed online SVMs for spam filtering , 2007, SIGIR.

[3]  D. Sculley,et al.  Online Active Learning Methods for Fast Label-Efficient Spam Filtering , 2007, CEAS.

[4]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[5]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[6]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[7]  Hou-Kuan Huang,et al.  Active learning with simplified SVMs for spam categorization , 2002, Proceedings. International Conference on Machine Learning and Cybernetics.

[8]  Rong Hu,et al.  Active Learning for Text Classification , 2011 .

[9]  Joshua Goodman,et al.  Online Discriminative Spam Filter Training , 2006, CEAS.

[10]  Gordon V. Cormack,et al.  Batch and Online Spam Filter Comparison , 2006, CEAS.

[11]  Richard Segal,et al.  Fast Uncertainty Sampling for Labeling Large E-mail Corpora , 2006, CEAS.

[12]  Dale Schuurmans,et al.  Discriminative Batch Mode Active Learning , 2007, NIPS.

[13]  M.R. Sabuncu,et al.  Gradient based nonuniform subsampling for information-theoretic alignment methods , 2004, The 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[14]  Gordon V. Cormack,et al.  TREC 2006 Spam Track Overview , 2006, TREC.

[15]  David M. J. Tax,et al.  Online SVM learning: from classification to data description and back , 2003, 2003 IEEE XIII Workshop on Neural Networks for Signal Processing (IEEE Cat. No.03TH8718).