Imbalanced Web Spam Classification Using Self-labeled Techniques and Multi-classifier Models

Web spam has become a critical problem in web search area. Unfortunately, highly imbalanced distribution and too many unlabeled instances always disturb the performance of classifiers. In this paper, we focus on solving the serious imbalance distribution of web spam under the semi-supervised learning frame. First, we introduce the self-labeled techniques and the multi-classifier mode. Second, the imbalance situation of web spam data sets and five combination methods are proposed. Particularly, we propose several improved self-labeled methods by using classic over-sampling technique SMOTE in pre-processing stage, and then balance the uneven labeled sets. Further, considering the serious imbalance situation of web spam, we introduce the AUC value into semi-supervised classification. Experiments under WEBSPAM UK2007 indicate that our methods can get better performance both on recall and AUC values.

[1]  Zhi-Hua Zhou,et al.  Tri-training: exploiting unlabeled data using three classifiers , 2005, IEEE Transactions on Knowledge and Data Engineering.

[2]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[3]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[4]  Marcin Luckner,et al.  Stable web spam detection using features based on lexical items , 2014, Comput. Secur..

[5]  María José del Jesús,et al.  KEEL: a software tool to assess evolutionary algorithms for data mining problems , 2008, Soft Comput..

[6]  Kai Li,et al.  A classification algorithm based on local cluster centers with a few labeled training examples , 2010, Knowl. Based Syst..

[7]  Zhi-Hua Zhou,et al.  Improve Computer-Aided Diagnosis With Machine Learning Techniques Using Undiagnosed Samples , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[8]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[9]  Chao Deng,et al.  A new co-training-style random forest for computer aided diagnosis , 2011, Journal of Intelligent Information Systems.

[10]  Huaxiang Zhang,et al.  RWO-Sampling: A random walk over-sampling approach to imbalanced data classification , 2014, Inf. Fusion.

[11]  Francisco Herrera,et al.  Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study , 2015, Knowledge and Information Systems.