Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam

The page rank of a commercial web site has an enormous economic impact because it directly influences the number of potential customers that find the site as a highly ranked search engine result. Link spamming – inflating the page rank of a target page by artificially creating many referring pages – has therefore become a common practice. In order to maintain the quality of their search results, search engine providers try to oppose efforts that decorrelate page rank and relevance and maintain blacklists of spamming pages while spammers, at the same time, try to camouflage their spam pages. We formulate the problem of identifying link spam and discuss a methodology for generating training data. Experiments reveal the effectiveness of classes of intrinsic and relational attributes and shed light on the robustness of classifiers against obfuscation of attributes by an adversarial spammer. We identify open research problems related to web spam.

[1]  S. Bornholdt,et al.  Scale-free topology of e-mail networks. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[2]  Krishna Bharat,et al.  Who links to whom: mining linkage between Web sites , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[3]  David Carmel,et al.  The connectivity sonar: detecting site functionality by structural patterns , 2003, HYPERTEXT '03.

[4]  Ingmar Weber,et al.  An Analysis of Factors Used in Search Engine Ranking , 2005, AIRWeb.

[5]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2004, Softw. Pract. Exp..

[6]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[7]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[8]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[9]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[10]  Ricardo A. Baeza-Yates,et al.  Pagerank Increase under Different Collusion Topologies , 2005, AIRWeb.

[11]  Michael J. Cafarella,et al.  Building Nutch: Open Source Search , 2004, ACM Queue.

[12]  Malik Magdon-Ismail,et al.  Optimal Link Bombs are Uncoordinated , 2005, AIRWeb.

[13]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[14]  Brian D. Davison,et al.  Identifying link farm spam pages , 2005, WWW '05.

[15]  Pedro M. Domingos,et al.  Adversarial classification , 2004, KDD.

[17]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[18]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[19]  Brian D. Davison Recognizing Nepotistic Links on the Web , 2000 .

[20]  Marc Najork,et al.  On the evolution of clusters of near-duplicate Web pages , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).