Spam detection through link authorization from neighboring nodes

Current link spam techniques aim at manipulating both good and bad pages to boost their desired target page(s) and attract web surfers. The web structure of today includes links from bad to good pages and vice versa as well as pages of same kind. It is widely known that good pages seldom connect to bad ones, hence, spamming is assumed when such connections occur. Therefore, such good pages are penalized. However, such penalization tend to be unfair since every web page has an honest and dishonest part. Besides, several factors such as pages similarity influences the web hyperlinks distribution. Based on this, the paper proposes Link Authorization Model to detect link spam propagation onto neighboring pages. We design metrics with relevant link and content features to compute the angular similarity between connecting good-bad pages. Then based on the angular similarity, we are able to predict page-links as true or false authorization. Hence, for every false authorization detected, the out-going page receives a penalization by a pre-determined threshold. Our results show an average spamicity of 0.77 and a corresponding demotion of 0.60.

[1]  Christian Platzer,et al.  Removing web spam links from search engine results , 2011, Journal in Computer Virology.

[2]  Ashutosh Kumar Singh,et al.  Link-based web spam detection using weight properties , 2014, Journal of Intelligent Information Systems.

[3]  András A. Benczúr,et al.  SpamRank -- Fully Automatic Link Spam Detection , 2005, AIRWeb.

[4]  Hinrich Schütze,et al.  Introduction to Information Retrieval: Scoring, term weighting, and the vector space model , 2008 .

[5]  Arnon Rungsawang,et al.  A novel approach for spam detection using boosting pages , 2011, 2011 Eighth International Joint Conference on Computer Science and Software Engineering (JCSSE).

[6]  Luca Becchetti,et al.  A reference collection for web spam , 2006, SIGF.

[7]  Shaojie Qiao,et al.  SimRank: A Page Rank approach based on similarity measure , 2010, 2010 IEEE International Conference on Intelligent Systems and Knowledge Engineering.

[8]  Torsten Suel,et al.  Improving web spam classifiers using link structure , 2007, AIRWeb '07.

[9]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[10]  Kentaro Inui,et al.  Web Spam Detection by Exploring Densely Connected Subgraphs , 2011, 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[11]  Hinrich Schütze,et al.  Scoring , term weighting and thevector space model , 2015 .

[12]  P. Metaxas,et al.  Enhancing Information Reliability through Backwards Propagation Of Distrust , 2009 .

[13]  Shivani Agarwal,et al.  Learning to rank on graphs , 2010, Machine Learning.

[14]  Bharti Dongre,et al.  Web Spam Detection Using Different Features , 2011 .

[15]  Xianchao Zhang,et al.  Propagating Both Trust and Distrust with Target Differentiation for Combating Link-Based Web Spam , 2014, TWEB.

[16]  Xianchao Zhang,et al.  Propagating Both Trust and Distrust with Target Differentiation for Combating Web Spam , 2011, AAAI.

[17]  A. Jain,et al.  Page Ranking Algorithms in Web Mining, Limitations of Existing Methods and a New Method for Indexing Web Pages , 2013, 2013 International Conference on Communication Systems and Network Technologies.

[18]  Brian D. Davison,et al.  Measuring similarity to detect qualified links , 2007, AIRWeb '07.

[19]  Hongfei Lin,et al.  Combating Web spam through trust-distrust propagation with confidence , 2013, Pattern Recognit. Lett..

[20]  David Maxwell Chickering,et al.  Improving Cloaking Detection using Search Query Popularity and Monetizability , 2006, AIRWeb.