SpamRank -- Fully Automatic Link Spam Detection

Spammers intend to increase the PageRank of certain spam pages by creating a large number of links pointing to them. We propose a novel method based on the concept of personalized PageRank that detects pages with an undeserved high PageRank value without the need of any kind of white or blacklists or other means of human intervention. We assume that spammed pages have a biased distribution of pages that contribute to the undeserved high PageRank value. We define SpamRank by penalizing pages that originate a suspicious PageRank share and personalizing PageRank on the penalties. Our method is tested on a 31 M page crawl of the .de domain with a manually classified 1000-page stratified random sample with bias towards large PageRank values.

[1]  Carl D. Meyer,et al.  Deeper Inside PageRank , 2004, Internet Math..

[2]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[3]  John Riedl,et al.  Shilling recommender systems for fun and profit , 2004, WWW '04.

[4]  Brian D. Davison Recognizing Nepotistic Links on the Web , 2000 .

[5]  Kevin S. McCurley,et al.  Ranking the web frontier , 2004, WWW '04.

[6]  Eli Upfal,et al.  Stochastic models for the Web graph , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[7]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[8]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[9]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[10]  Ricardo A. Baeza-Yates,et al.  Pagerank Increase under Different Collusion Topologies , 2005, AIRWeb.

[11]  Chris H. Q. Ding,et al.  PageRank, HITS and a unified framework for link analysis , 2002, SIGIR '02.

[12]  David Carmel,et al.  The connectivity sonar: detecting site functionality by structural patterns , 2003, HYPERTEXT '03.

[13]  Ramesh Govindan,et al.  Making Eigenvector-Based Reputation Systems Robust to Collusion , 2004, WAW.

[14]  Allan Borodin,et al.  Finding authorities and hubs from link structures on the World Wide Web , 2001, WWW '01.

[15]  Dániel Fogaras,et al.  Where to Start Browsing the Web? , 2003, IICS.

[16]  Alois Potton Spam , 2003, PIK Prax. Informationsverarbeitung Kommun..

[17]  Brian D. Davison,et al.  Identifying link farm pages , 2005, WWW 2005.

[18]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[19]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[20]  Eli Upfal,et al.  Using PageRank to Characterize Web Structure , 2002, Internet Math..

[21]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[22]  Eli Upfal,et al.  Web search using automatic classification , 1996, WWW 1996.

[23]  Amit Singhal,et al.  Challenges in running a commercial search engine , 2005, SIGIR '05.

[24]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[25]  Shlomo Moran,et al.  The stochastic approach for link-structure analysis (SALSA) and the TKC effect , 2000, Comput. Networks.

[26]  Gareth O. Roberts,et al.  Downweighting tightly knit communities in world wide web ranking. , 2003 .

[27]  A. Barabasi,et al.  Scale-free characteristics of random networks: the topology of the world-wide web , 2000 .

[28]  Dániel Fogaras,et al.  Towards Scaling Fully Personalized PageRank , 2004, WAW.

[29]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[30]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[31]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[32]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[33]  Franco Scarselli,et al.  Inside PageRank , 2005, TOIT.

[34]  Jon M. Kleinberg,et al.  Mining the Web's Link Structure , 1999, Computer.