On characterizing and computing the diversity of hyperlinks for anti-spamming page ranking

With the advent of big data era, efficiently and effectively querying useful information on the Web, the largest heterogeneous data source in the world, is becoming increasingly challenging. Page ranking is an essential component of search engines because it determines the presentation sequence of the tens of millions of returned pages associated with a single query. It therefore plays a significant role in regulating the search quality and user experience for information retrieval. When measuring the authority of a web page, most methods focus on the quantity and the quality of the neighborhood pages that direct to it using inbound hyperlinks. However, these methods ignore the diversity of such neighborhood pages, which we believe is an important metric for objectively evaluating web page authority. In comparison with true authority pages that usually contain a large number of inbound hyperlinks from a wide variety of sources, it is difficult for fake authorities, which boost their page rank using techniques such as link farms, to occupy the high diversity of inbound hyperlinks due to prohibitively high costs. We propose a probabilistic counting-based method to quantitatively and efficiently compute the diversity of inbound hyperlinks. We then propose a novel link-based ranking algorithm, named Drank, to rank pages by simultaneously analyzing the quantity, quality and diversity of their inbound hyperlinks. The validations on both synthetic and real-world data show that Drank outperforms other state-of-the-art methods in terms of both finding high-quality pages and suppressing web spams.

[1]  Carlos Castillo,et al.  Graph regularization methods for Web spam detection , 2010, Machine Learning.

[2]  Pavel Berkhin,et al.  A Survey on PageRank Computing , 2005, Internet Math..

[3]  Shlomo Moran,et al.  SALSA: the stochastic approach for link-structure analysis , 2001, TOIS.

[4]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[5]  Michael Brinkmeier,et al.  PageRank revisited , 2006, TOIT.

[6]  Rashmi Raj,et al.  Web Spam Detection with Anti-Trust Rank , 2006, AIRWeb.

[7]  Tamara G. Kolda,et al.  Generalized BadRank with Graduated Trust , 2009 .

[8]  Seong-Gon Kim,et al.  Ranking billions of web pages using diodes , 2009, Commun. ACM.

[9]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[10]  David H. Reiley,et al.  The Economics of Spam , 2012 .

[11]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[12]  杨博,et al.  A Novel page ranking algorithm based on analyzing the diversity of inbound hyperlinks , 2014 .

[13]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[14]  Gareth O. Roberts,et al.  Downweighting tightly knit communities in world wide web ranking. , 2003 .

[15]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[16]  Xin Zhao,et al.  Using spam farm to boost PageRank , 2007, AIRWeb '07.

[17]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[18]  Carl D. Meyer,et al.  Deeper Inside PageRank , 2004, Internet Math..

[19]  Allan Borodin,et al.  Link analysis ranking: algorithms, theory, and experiments , 2005, TOIT.

[20]  Kyu-Young Whang,et al.  A linear-time probabilistic counting algorithm for database applications , 1990, TODS.

[21]  Christos Faloutsos,et al.  Data mining on large graphs , 2002 .

[22]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[23]  Brian D. Davison,et al.  Topical TrustRank: using topicality to combat web spam , 2006, WWW '06.

[24]  Michael R. Lyu,et al.  DiffusionRank: a possible penicillin for web spamming , 2007, SIGIR.

[25]  Luca Becchetti,et al.  Link analysis for Web spam detection , 2008, TWEB.

[26]  Brian D. Davison,et al.  Propagating Trust and Distrust to Demote Web Spam , 2006, MTW.

[27]  Hector Garcia-Molina,et al.  Link Spam Alliances , 2005, VLDB.

[28]  Kentaro Inui,et al.  Web Spam Detection by Exploring Densely Connected Subgraphs , 2011, 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[29]  Franco Scarselli,et al.  Inside PageRank , 2005, TOIT.

[30]  Martin Rosvall,et al.  Ranking and clustering of nodes in networks with smart teleportation , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[31]  Jian Pei,et al.  Link spam target detection using page farms , 2009, TKDD.