Detecting Web Spam in Webgraphs with Predictive Model Analysis

Web spam is a serious threat for both end-users and search engines (w.r.t., query cost). Webgraphs can be exploited in detecting spam. In the past, several graph mining techniques were applied to measure metrics for pages and hyperlinks. In this paper, we justify the importance of webgraph to distinguish spam websites from non-spam ones based on several graph metrics computed for a labelled dataset (WEBSPAM-UK2007) and justify our model by testing on uk-2014 dataset, the most recently available dataset on the same (uk) domain. WEBSPAM-UK2007 dataset includes 0.1 million different hosts and four kinds of feature sets: Obvious, Link, Transformed Link and Content. We use five prominent machine learning (ML) techniques (i.e., Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Logistic Regression, Naïve Bayes and Random Forest) to build a ML-based classifier. To evaluate the performance of our classifier, we compute accuracy and F-1 score and perform 10-fold cross validation. We also compare graph based features with content based textual features and find that graph properties are similar or better than text properties. We achieve above 99% training accuracy for most of our machine learning models. We test our model with uk-2014 dataset with 4.7 million hosts for the graph-based feature sets and achieve accuracy in between 90-94% for most of the models. To the best of our knowledge, prior works on web spam detection with WEBSPAM-UK2007 dataset did not use different test dataset for their models. Our model classifier is capable of detecting web spam for any input webgraph based on its graph metrics features.

[1]  Daniela M. Witten,et al.  An Introduction to Statistical Learning: with Applications in R , 2013 .

[2]  András A. Benczúr,et al.  Web spam challenge proposal for filtering in archives , 2009, AIRWeb '09.

[3]  Qing Yang,et al.  Trustworthy Website Detection Based on Social Hyperlink Network Analysis , 2020, IEEE Transactions on Network Science and Engineering.

[4]  Florentino Fernández Riverola,et al.  WARCProcessor: An Integrative Tool for Building and Management of Web Spam Corpora , 2018, Sensors.

[5]  David Maxwell Chickering,et al.  Improving Cloaking Detection using Search Query Popularity and Monetizability , 2006, AIRWeb.

[6]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[7]  Luca Becchetti,et al.  A reference collection for web spam , 2006, SIGF.

[8]  Brian D. Davison Recognizing Nepotistic Links on the Web , 2000 .

[9]  Ashutosh Kumar Singh,et al.  Comprehensive Literature Review on Machine Learning Structures for Web Spam Classification , 2015 .

[10]  Gilles Louppe,et al.  Independent consultant , 2013 .

[11]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[12]  Thomas J. Watson,et al.  An empirical study of the naive Bayes classifier , 2001 .

[13]  Marcin Luckner,et al.  Practical Web Spam Lifelong Machine Learning System with Automatic Adjustment to Current Lifecycle Phase , 2019, Secur. Commun. Networks.

[14]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[15]  Debora Donato,et al.  The Web as a graph: How far we are , 2007, TOIT.

[16]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[17]  Hector Garcia-Molina,et al.  Spam: it's not just for inboxes anymore , 2005, Computer.

[18]  Luca Becchetti,et al.  Efficient semi-streaming algorithms for local triangle counting in massive graphs , 2008, KDD.

[19]  Luca Becchetti,et al.  Link analysis for Web spam detection , 2008, TWEB.

[20]  András A. Benczúr,et al.  Web spam filtering in internet archives , 2009, AIRWeb '09.

[21]  Jian Pei,et al.  Link spam target detection using page farms , 2009, TKDD.

[22]  Amit Singhal,et al.  Challenges in running a commercial search engine , 2005, SIGIR '05.

[23]  Luca Becchetti,et al.  Using rank propagation and Probabilistic counting for Link-Based Spam Detection , 2006 .

[24]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[25]  Marco Rosa,et al.  Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks , 2010, WWW.

[26]  Sebastiano Vigna,et al.  BUbiNG: massive crawling for the masses , 2014, WWW.

[27]  Mohammad S. Obaidat,et al.  FS2RNN: Feature Selection Scheme for Web Spam Detection Using Recurrent Neural Networks , 2018, 2018 IEEE Global Communications Conference (GLOBECOM).

[28]  Gilad Mishne,et al.  Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.

[29]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[30]  Yaxin Bi,et al.  KNN Model-Based Approach in Classification , 2003, OTM.

[31]  川野 秀一 An Introduction to Statistical Learning (with Applications in R), Gareth James,Daniela Witten,Trevor Hastie and Robert Tibshirani著, Springer, 2013年8月, 430pp., 価格 59.99〓, ISBN 978-1-4614-7137-0 , 2014 .

[32]  Tengyu Ma,et al.  CS229 Lecture notes , 2007 .

[33]  Malik Muneeb Abid,et al.  Classification of Malicious Web Pages through a J48 Decision Tree, aNaïve Bayes, a RBF Network and a Random Forest Classifier forWebSpam Detection , 2017 .

[34]  Rahul Khanna,et al.  Support Vector Machines for Classification , 2015 .

[35]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[36]  Mevlut Ture,et al.  Comparing performances of logistic regression, classification and regression tree, and neural networks for predicting coronary artery disease , 2008, Expert Syst. Appl..

[37]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[38]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[39]  Mohammad Ali Zare Chahooki,et al.  Web Spam Detection Using Multiple Kernels in Twin Support Vector Machine , 2016, ArXiv.

[40]  András A. Benczúr,et al.  Web spam classification: a few features worth more , 2011, WebQuality '11.

[41]  Alex Hai Wang,et al.  Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach , 2010, DBSec.