Adaptive spam filtering using dynamic feature space

Unsolicited bulk e-mail, also known as spam, has been an increasing problem for the e-mail society. This paper presents a new spam filtering strategy that 1) uses a practical entropy coding technique, Huffman coding, to dynamically encode the feature space of e-mail collections over time and, 2) applies an online algorithm to adaptively enhance the learned spam concept as new e-mail data becomes available. The contributions of this work include a highly efficient spam filtering algorithm in which the input space is radically reduced to a single-dimension input vector, and an adaptive learning technique that is robust to vocabulary change, concept drifting and skewed data distribution. We compare our technique to several existing off-line learning techniques including support vector machine, naive Bayes, k-nearest neighbor, C4.5 decision tree, RBFNetwork, boosted decision tree and stacking, and demonstrate the effectiveness of our technique by presenting the experimental results on the e-mail data that is publicly available

[1]  Karl-Michael Schneider,et al.  A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering , 2003, EACL.

[2]  J. A. Anderson,et al.  Logistic Discrimination and Bias Correction in Maximum Likelihood Estimation , 1979 .

[3]  Sahibsingh A. Dudani The Distance-Weighted k-Nearest-Neighbor Rule , 1976, IEEE Transactions on Systems, Man, and Cybernetics.

[4]  Lu Xianliang,et al.  A LVQ-based neural network anti-spam email approach , 2005 .

[5]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[6]  V. Barnett,et al.  Applied Linear Statistical Models , 1975 .

[7]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[8]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[9]  Tom Fawcett "In vivo" spam filtering: A challenge problem for data mining , 2004, ArXiv.

[10]  Xu Zhou,et al.  A LVQ-based neural network anti-spam email approach , 2005, OPSR.

[11]  Georgios Paliouras,et al.  Stacking Classifiers for Anti-Spam Filtering of E-Mail , 2001, EMNLP.

[12]  Donald E. Knuth,et al.  Dynamic Huffman Coding , 1985, J. Algorithms.

[13]  F. Ramsey,et al.  The statistical sleuth : a course in methods of data analysis , 2002 .

[14]  Lluís Màrquez i Villodre,et al.  Boosting Trees for Anti-Spam Email Filtering , 2001, ArXiv.

[15]  Constantine D. Spyropoulos,et al.  An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[16]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[17]  Gerhard Widmer,et al.  Learning in the presence of concept drift and hidden contexts , 2004, Machine Learning.

[18]  José María Gómez Hidalgo,et al.  Evaluating cost-sensitive Unsolicited Bulk Email categorization , 2002, SAC '02.

[19]  Le Zhang,et al.  Filtering Junk Mail with a Maximum Entropy Model , 2003 .

[20]  Jefferson Provost,et al.  Na ive-Bayes vs. Rule-Learning in Classification of Email , 1999 .

[21]  David W. Aha,et al.  Tolerating Noisy, Irrelevant and Novel Attributes in Instance-Based Learning Algorithms , 1992, Int. J. Man Mach. Stud..

[22]  D. Huffman A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[23]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[24]  Georgios Paliouras,et al.  A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists , 2004, Information Retrieval.

[25]  Joshua Alspector,et al.  SVM-based Filtering of E-mail Spam with Content-specic Misclassication Costs , 2001 .

[26]  Robert G. Gallager,et al.  Variations on a theme by Huffman , 1978, IEEE Trans. Inf. Theory.