Improving Anti-spam Engine with Large Imbalanced Dataset Using Information Retrieval Technology

Anti-spam technology always employs machine learning to identify spam emails. Unfortunately, the email samples used to establish machine learning models are always not in a ideal status: there are too many spam emails compared with normal ones, which may lead to biased machine learning models and unsatisfactory performance in prediction. Besides, there are too many email samples, which lead to unaffordable resource consuming to run machine learning training process and thus difficult for human engineers to sort. In this paper, we proposed an information retrieval technology based approach to compress and balance the training data set . The key breakthrough here is to shrink and balance the training data set by removing similar data using information retrieval technology. Experiments show anti-spam classifier can have better performance with a much smaller and balanced training data set by applying this approach???

[1]  Lluís Màrquez i Villodre,et al.  Boosting Trees for Anti-Spam Email Filtering , 2001, ArXiv.

[2]  Susannah Fox,et al.  Generations online in 2009 , 2009 .

[3]  Glenn Fung,et al.  Incremental Support Vector Machine Classification , 2002, SDM.

[4]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[5]  Yanqing Zhang,et al.  Fast and Effective Spam Sender Detection with Granular SVM on Highly Imbalanced Mail Server Behavior Data , 2006, 2006 International Conference on Collaborative Computing: Networking, Applications and Worksharing.

[6]  Gert Cauwenberghs,et al.  SVM incremental learning, adaptation and optimization , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[7]  Stefan Rüping,et al.  Incremental Learning with Support Vector Machines , 2001, ICDM.

[8]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[9]  Nizar Bouguila,et al.  Improved Online Support Vector Machines Spam Filtering Using String Kernels , 2009, CIARP.

[10]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[11]  Longin Jan Latecki,et al.  Improving SVM Classification on Imbalanced Data Sets in Distance Spaces , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[12]  Huan Liu,et al.  Handling concept drifts in incremental learning with support vector machines , 1999, KDD '99.

[13]  Gert Cauwenberghs,et al.  Incremental and Decremental Support Vector Machine Learning , 2000, NIPS.

[14]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[15]  Federico Girosi,et al.  An improved training algorithm for support vector machines , 1997, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop.

[16]  Xiangji Huang,et al.  Boosting Prediction Accuracy on Imbalanced Datasets with SVM Ensembles , 2006, PAKDD.

[17]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).