Spam Detection Using Dynamic Weighted Voting Based on Clustering

In the last decade spam detection has been addressed as a text classification or categorization problem. In this paper we propose a new dynamic weighted voting method based on the combination of clustering and weighted voting, and apply it to the task of spam filtering. In order to classify a new sample, it first compares with all cluster centroids and its similarity to each cluster is identified; Classifiers in the vicinity of the input sample obtain greater weight for the final decision of the ensemble. The evaluation shows that the algorithm outperforms pure SVM.

[1]  Lluís Màrquez i Villodre,et al.  Boosting Trees for Anti-Spam Email Filtering , 2001, ArXiv.

[2]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Data Mining Researchers , 2003 .

[3]  Jeffrey Xu Yu,et al.  A Balanced Ensemble Approach to Weighting Classifiers for Text Classification , 2006, Sixth International Conference on Data Mining (ICDM'06).

[4]  Ludmila I. Kuncheva,et al.  Switching between selection and fusion in combining classifiers: an experiment , 2002, IEEE Trans. Syst. Man Cybern. Part B.

[5]  Georgios Paliouras,et al.  Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach , 2000, ArXiv.

[6]  Nicolás García-Pedrajas,et al.  Nonlinear Boosting Projections for Ensemble Construction , 2007, J. Mach. Learn. Res..

[7]  James C. Bezdek,et al.  Decision templates for multiple classifier fusion: an experimental comparison , 2001, Pattern Recognit..

[8]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[9]  Robert Sabourin,et al.  A dynamic overproduce-and-choose strategy for the selection of classifier ensembles , 2008, Pattern Recognit..

[10]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[11]  Chun-Xia Zhang,et al.  A local boosting algorithm for solving classification problems , 2008, Comput. Stat. Data Anal..

[12]  Robi Polikar,et al.  An Ensemble-Based Incremental Learning Approach to Data Fusion , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[13]  Bogdan Gabrys,et al.  Classifier selection for majority voting , 2005, Inf. Fusion.

[14]  Gordon V. Cormack,et al.  TREC 2006 Spam Track Overview , 2006, TREC.

[15]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.