论文信息 - Adaptive spam filtering using dynamic feature space - 字舞流文

Adaptive spam filtering using dynamic feature space

Unsolicited bulk e-mail, also known as spam, has been an increasing problem for the e-mail society. This paper presents a new spam filtering strategy that 1) uses a practical entropy coding technique, Huffman coding, to dynamically encode the feature space of e-mail collections over time and, 2) applies an online algorithm to adaptively enhance the learned spam concept as new e-mail data becomes available. The contributions of this work include a highly efficient spam filtering algorithm in which the input space is radically reduced to a single-dimension input vector, and an adaptive learning technique that is robust to vocabulary change, concept drifting and skewed data distribution. We compare our technique to several existing off-line learning techniques including support vector machine, naive Bayes, k-nearest neighbor, C4.5 decision tree, RBFNetwork, boosted decision tree and stacking, and demonstrate the effectiveness of our technique by presenting the experimental results on the e-mail data that is publicly available

Yan Zhou | Madhuri S. Mulekar | Praveen Nerellapalli | M. Mulekar | Yan Zhou | Praveen Nerellapalli

[1] Karl-Michael Schneider,et al. A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering , 2003, EACL.

[2] J. A. Anderson,et al. Logistic Discrimination and Bias Correction in Maximum Likelihood Estimation , 1979 .

[3] Sahibsingh A. Dudani. The Distance-Weighted k-Nearest-Neighbor Rule , 1976, IEEE Transactions on Systems, Man, and Cybernetics.

[4] Lu Xianliang,et al. A LVQ-based neural network anti-spam email approach , 2005 .

[5] Susan T. Dumais,et al. A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[6] V. Barnett,et al. Applied Linear Statistical Models , 1975 .

[7] Michael McGill,et al. Introduction to Modern Information Retrieval , 1983 .

[8] David H. Wolpert,et al. Stacked generalization , 1992, Neural Networks.

[9] Tom Fawcett. "In vivo" spam filtering: A challenge problem for data mining , 2004, ArXiv.

[10] Xu Zhou,et al. A LVQ-based neural network anti-spam email approach , 2005, OPSR.

[11] Georgios Paliouras,et al. Stacking Classifiers for Anti-Spam Filtering of E-Mail , 2001, EMNLP.

[12] Donald E. Knuth,et al. Dynamic Huffman Coding , 1985, J. Algorithms.

[13] F. Ramsey,et al. The statistical sleuth : a course in methods of data analysis , 2002 .

[14] Lluís Màrquez i Villodre,et al. Boosting Trees for Anti-Spam Email Filtering , 2001, ArXiv.

[15] Constantine D. Spyropoulos,et al. An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[16] Harris Drucker,et al. Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[17] Gerhard Widmer,et al. Learning in the presence of concept drift and hidden contexts , 2004, Machine Learning.

[18] José María Gómez Hidalgo,et al. Evaluating cost-sensitive Unsolicited Bulk Email categorization , 2002, SAC '02.

[19] Le Zhang,et al. Filtering Junk Mail with a Maximum Entropy Model , 2003 .

[20] Jefferson Provost,et al. Na ive-Bayes vs. Rule-Learning in Classification of Email , 1999 .

[21] David W. Aha,et al. Tolerating Noisy, Irrelevant and Novel Attributes in Instance-Based Learning Algorithms , 1992, Int. J. Man Mach. Stud..

[22] D. Huffman. A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[23] Adam L. Berger,et al. A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[24] Georgios Paliouras,et al. A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists , 2004, Information Retrieval.

[25] Joshua Alspector,et al. SVM-based Filtering of E-mail Spam with Content-specic Misclassication Costs , 2001 .

[26] Robert G. Gallager,et al. Variations on a theme by Huffman , 1978, IEEE Trans. Inf. Theory.