A unified score propagation model for web spam demotion algorithm

Web spam pages exploit the biases of search engine algorithms to get higher than their deserved rankings in search results by using several types of spamming techniques. Many web spam demotion algorithms have been developed to combat spam via the use of the web link structure, from which the goodness or badness score of each web page is evaluated. Those scores are then used to identify spam pages or punish their rankings in search engine results. However, most of the published spam demotion algorithms differ from their base models by only very limited improvements and still suffer from some common score manipulation methods. The lack of a general framework for this field makes the task of designing high-performance spam demotion algorithms very inefficient. In this paper, we propose a unified score propagation model for web spam demotion algorithms by abstracting the score propagation process of relevant models with a forward score propagation function and a backward score propagation function, each of which can further be expressed as three sub-functions: a splitting function, an accepting function and a combination function. On the basis of the proposed model, we develop two new web spam demotion algorithms named Supervised Forward and Backward score Ranking (SFBR) and Unsupervised Forward and Backward score Ranking (UFBR). Our experiments, conducted on three large-scale public datasets, show that (1) SFBR is very robust and apparently outperforms other algorithms and (2) UFBR can obtain results comparable to some well-known supervised algorithms in the spam demotion task even if the UFBR is unsupervised.

[1]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[2]  Rashmi Raj,et al.  Web Spam Detection with Anti-Trust Rank , 2006, AIRWeb.

[3]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[4]  David Maxwell Chickering,et al.  Improving Cloaking Detection using Search Query Popularity and Monetizability , 2006, AIRWeb.

[5]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[6]  Brian D. Davison,et al.  Web Spam Challenge , 2007 .

[7]  Massimo Marchiori,et al.  The Quest for Correct Information on the Web: Hyper Search Engines , 1997, Comput. Networks.

[8]  Xianchao Zhang,et al.  Propagating Both Trust and Distrust with Target Differentiation for Combating Web Spam , 2011, AAAI.

[9]  Luca Becchetti,et al.  Web Spam Detection : link-based and content-based techniques , 2007 .

[10]  Brian D. Davison,et al.  AIRWeb 2007 : proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web, May 8, 2007, Banff, Alberta, Canada , 2007 .

[11]  Ashish Chandra,et al.  Web spam classification using supervised artificial neural network algorithms , 2015, ArXiv.

[12]  Izzat Alsmadi,et al.  Content-based analysis to detect Arabic web spam , 2012, J. Inf. Sci..

[13]  Ling Liu,et al.  Countering web spam with credibility-based link analysis , 2007, PODC '07.

[14]  Luca Becchetti,et al.  Link-Based Characterization and Detection of Web Spam , 2006, AIRWeb.

[15]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[16]  Charles L. A. Clarke,et al.  Efficient and effective spam filtering and re-ranking for large web datasets , 2010, Information Retrieval.

[17]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[18]  Nadine Höchstötter,et al.  Standard parameters for searching behaviour in search engines and their empirical evaluation , 2009, J. Inf. Sci..

[19]  Ludovic Denoyer,et al.  Web spam challenge 2008 , 2008, AIRWeb 2008.

[20]  Ricardo A. Baeza-Yates,et al.  Generalizing PageRank: damping functions for link-based ranking algorithms , 2006, SIGIR.

[21]  Luca Becchetti,et al.  A reference collection for web spam , 2006, SIGF.

[22]  Filip Radlinski,et al.  Addressing Malicious Noise in Clickthrough Data , 2007 .

[23]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[24]  Marco Gori,et al.  A unified probabilistic framework for Web page scoring systems , 2004, IEEE Transactions on Knowledge and Data Engineering.

[25]  Brian D. Davison,et al.  Propagating Trust and Distrust to Demote Web Spam , 2006, MTW.

[26]  Jiawei Han,et al.  Survey on web spam detection: principles and algorithms , 2012, SKDD.

[27]  Abhishek Mathur,et al.  Content based web spam detection using naive bayes with different feature representation technique , 2013 .

[28]  Hongfei Lin,et al.  Combating Web spam through trust-distrust propagation with confidence , 2013, Pattern Recognit. Lett..