DirichletRank: Ranking Web Pages Against Link Spams

Anti-spamming has become one of the most important challenges to web search engines and attracted increasing attention in both industry and academia recently. Since most search engines now use link-based ranking algorithms, link-based spamming has become a major threaten. In this paper, we show that the popular link-based ranking algorithm PageRank, while being successfully used in the Google search engine, has a “zero-one gap” flaw, which can be potentially exploited to spam PageRank results easily. The “zero-one gap” problem arises from the current ad hoc way of computing the transition probabilities in the random surfing model. We propose a novel DirichletRank algorithm in a more principled way of computing these probabilities based on Bayesian estimation with a Dirichlet prior. DirichletRank is a variant of PageRank, but it does not have the problem of “zero-one gap” and is analytically shown to be substantially more resistant to link farm spams than PageRank. Simulation experiments using real web data show that, compared with the original PageRank, DirichletRank is significantly more robust against several typical link spams and is more stable under link perturbations, in general. Moreover, experiment results also show that DirichletRank 1 is more effective than PageRank due to its more reasonable allocation of transition probabilities. Since DirichletRank can be computed as efficiently as PageRank, it is scalable to large-scale web applications.

[1]  Rajeev Motwani,et al.  Randomized Algorithms , 1995, SIGA.

[2]  John D. Lafferty,et al.  Two-stage language models for information retrieval , 2002, SIGIR '02.

[3]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[4]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[5]  Michael I. Jordan,et al.  Stable algorithms for link analysis , 2001, SIGIR '01.

[6]  Azadeh Shakery,et al.  Dirichlet PageRank , 2005, SIGIR '05.

[7]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[8]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[9]  Felix Schlenk,et al.  Proof of Theorem 3 , 2005 .

[10]  Wei Zhang,et al.  Improvement of HITS-based algorithms on web documents , 2002, WWW '02.

[11]  Franco Scarselli,et al.  Inside PageRank , 2005, TOIT.

[12]  Brian D. Davison,et al.  Identifying link farm spam pages , 2005, WWW '05.

[13]  András A. Benczúr,et al.  SpamRank -- Fully Automatic Link Spam Detection , 2005, AIRWeb.

[14]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[15]  Qiang Yang,et al.  Exploiting the hierarchical structure for link analysis , 2005, SIGIR '05.

[16]  Hector Garcia-Molina,et al.  Link Spam Alliances , 2005, VLDB.

[17]  Chris H. Q. Ding,et al.  PageRank, HITS and a unified framework for link analysis , 2002, SIGIR '02.

[18]  David Cohn,et al.  Learning to Probabilistically Identify Authoritative Documents , 2000, ICML.

[19]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[20]  R. Varga,et al.  Proof of Theorem 4 , 1983 .

[21]  Alois Potton Spam , 2003, PIK Prax. Informationsverarbeitung Kommun..

[22]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[23]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[24]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[25]  Wei-Ying Ma,et al.  Block-level link analysis , 2004, SIGIR '04.

[26]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.