Machine Learning Methods for Spamdexing Detection

In this paper, we present recent contributions for the battle against one of the main problems faced by search engines: the spamdexing or web spamming. They are malicious techniques used in web pages with the purpose of circumvent the search engines in order to achieve good visibility in search results. To better understand the problem and finding the best setup and methods to avoid such virtual plague, in this paper we present a comprehensive performance evaluation of several established machine learning techniques. In our experiments, we employed two real, public and large datasets: the WEBSPAM-UK2006 and the WEBSPAM-UK2007 collections. The samples are represented by content-based, link-based, transformed link-based features and their combinations. The found results indicate that bagging of decision trees, multilayer perceptron neural networks, random forest and adaptive boosting of decision trees are promising in the task of web spam classification.

[1]  T.,et al.  Training Feedforward Networks with the Marquardt Algorithm , 2004 .

[2]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[3]  Martín Abadi,et al.  deSEO: Combating Search-Result Poisoning , 2011, USENIX Security Symposium.

[4]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[5]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[6]  Tie-Yan Liu,et al.  Detecting Link Spam Using Temporal Information , 2006, Sixth International Conference on Data Mining (ICDM'06).

[7]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[8]  John Mark,et al.  Introduction to radial basis function networks , 1996 .

[9]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[10]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[11]  Jun-Lin Lin Detection of cloaked web spam by using tag-based methods , 2009, Expert Syst. Appl..

[12]  Luca Becchetti,et al.  Using rank propagation and Probabilistic counting for Link-Based Spam Detection , 2006 .

[13]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[14]  Akebo Yamakami,et al.  Towards Web Spam Filtering with Neural-Based Approaches , 2012, IBERAMIA.

[15]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[16]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[17]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[18]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[19]  Jiawei Han,et al.  Survey on web spam detection: principles and algorithms , 2012, SKDD.

[20]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[21]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[22]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[23]  Qiang Wu,et al.  Improving web spam classification using rank-time features , 2007, AIRWeb '07.

[24]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[25]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[26]  Teuvo Kohonen,et al.  The self-organizing map , 1990, Neurocomputing.

[27]  Marc Najork Web Spam Detection , 2009, Encyclopedia of Database Systems.

[28]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[29]  Douglas C. Montgomery,et al.  Applied Statistics and Probability for Engineers, Third edition , 1994 .

[30]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[31]  Anjali Sardana,et al.  A Reputation Based Detection Technique to Cloaked Web Spam , 2012 .

[32]  Hector Garcia-Molina,et al.  Spam: it's not just for inboxes anymore , 2005, Computer.

[33]  Christian Platzer,et al.  Removing web spam links from search engine results , 2011, Journal in Computer Virology.

[34]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[35]  Akebo Yamakami,et al.  An Analysis of Machine Learning Methods for Spam Host Detection , 2012, 2012 11th International Conference on Machine Learning and Applications.

[36]  Wenke Lee,et al.  SURF: detecting and measuring search poisoning , 2011, CCS '11.

[37]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[38]  Akebo Yamakami,et al.  Artificial Neural Networks For Content-based Web Spam Detection , 2012 .

[39]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[40]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[41]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.