论文信息 - Extracting link spam using biased random walks from spam seed sets

Extracting link spam using biased random walks from spam seed sets

Link spam deliberately manipulates hyperlinks between web pages in order to unduly boost the search engine ranking of one or more target pages. Link based ranking algorithms such as PageRank, HITS, and other derivatives are especially vulnerable to link spam. Link farms and link exchanges are two common instances of link spam that produce spam communities -- i.e., clusters in the web graph. In this paper, we present a directed approach to extracting link spam communities when given one or more members of the community. In contrast to previous completely automated approaches to finding link spam, our method is specifically designed to be used interactively. Our approach starts with a small spam seed set provided by the user and simulates a random walk on the web graph. The random walk is biased to explore the local neighborhood around the seed set through the use of decay probabilities. Truncation is used to retain only the most frequently visited nodes. After termination, the nodes are sorted in decreasing order of their final probabilities and presented to the user. Experiments using manually labeled link spam data sets and random walks from a single seed domain show that the approach achieves over 95.12% precision in extracting large link farms and 80.46% precision in extracting link exchange centroids.

Baoning Wu | Kumar Chellapilla

[1] Rashmi Raj,et al. Web Spam Detection with Anti-Trust Rank , 2006, AIRWeb.

[2] Hector Garcia-Molina,et al. Web Spam Taxonomy , 2005, AIRWeb.

[3] Kevin J. Lang,et al. Communities from seed sets , 2006, WWW '06.

[4] Shang-Hua Teng,et al. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems , 2003, STOC '04.

[5] Brian D. Davison. Recognizing Nepotistic Links on the Web , 2000 .

[6] Ravi Kumar,et al. Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[7] Piotr Indyk,et al. Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[8] Eli Upfal,et al. Web search using automatic classification , 1996, WWW 1996.

[9] Kevin S. McCurley,et al. Ranking the web frontier , 2004, WWW '04.

[10] Brian D. Davison,et al. Identifying link farm pages , 2005, WWW 2005.

[11] Brian D. Davison,et al. Identifying link farm spam pages , 2005, WWW '05.