Extracting link spam using biased random walks from spam seed sets

Link spam deliberately manipulates hyperlinks between web pages in order to unduly boost the search engine ranking of one or more target pages. Link based ranking algorithms such as PageRank, HITS, and other derivatives are especially vulnerable to link spam. Link farms and link exchanges are two common instances of link spam that produce spam communities -- i.e., clusters in the web graph. In this paper, we present a directed approach to extracting link spam communities when given one or more members of the community. In contrast to previous completely automated approaches to finding link spam, our method is specifically designed to be used interactively. Our approach starts with a small spam seed set provided by the user and simulates a random walk on the web graph. The random walk is biased to explore the local neighborhood around the seed set through the use of decay probabilities. Truncation is used to retain only the most frequently visited nodes. After termination, the nodes are sorted in decreasing order of their final probabilities and presented to the user. Experiments using manually labeled link spam data sets and random walks from a single seed domain show that the approach achieves over 95.12% precision in extracting large link farms and 80.46% precision in extracting link exchange centroids.

[1]  Rashmi Raj,et al.  Web Spam Detection with Anti-Trust Rank , 2006, AIRWeb.

[2]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[3]  Kevin J. Lang,et al.  Communities from seed sets , 2006, WWW '06.

[4]  Shang-Hua Teng,et al.  Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems , 2003, STOC '04.

[5]  Brian D. Davison Recognizing Nepotistic Links on the Web , 2000 .

[6]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[7]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[8]  Eli Upfal,et al.  Web search using automatic classification , 1996, WWW 1996.

[9]  Kevin S. McCurley,et al.  Ranking the web frontier , 2004, WWW '04.

[10]  Brian D. Davison,et al.  Identifying link farm pages , 2005, WWW 2005.

[11]  Brian D. Davison,et al.  Identifying link farm spam pages , 2005, WWW '05.

[12]  András A. Benczúr,et al.  SpamRank -- Fully Automatic Link Spam Detection , 2005, AIRWeb.

[13]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[14]  Brian D. Davison,et al.  Topical TrustRank: using topicality to combat web spam , 2006, WWW '06.

[15]  Franco Scarselli,et al.  Inside PageRank , 2005, TOIT.

[16]  Hector Garcia-Molina,et al.  Link Spam Alliances , 2005, VLDB.

[17]  David Carmel,et al.  The connectivity sonar: detecting site functionality by structural patterns , 2003, HYPERTEXT '03.

[18]  Ramesh Govindan,et al.  Making Eigenvector-Based Reputation Systems Robust to Collusion , 2004, WAW.

[19]  Xin Zhao,et al.  Using spam farm to boost PageRank , 2007, AIRWeb '07.

[20]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[21]  C. Lee Giles,et al.  Efficient identification of Web communities , 2000, KDD '00.

[22]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[23]  Ricardo A. Baeza-Yates,et al.  Pagerank Increase under Different Collusion Topologies , 2005, AIRWeb.