Scalable Anti-TrustRank with Qualified Site-level Seeds for Link-based Web Spam Detection

Web spam detection is one of the most important and challenging tasks in web search. Since web spam pages tend to have a lot of spurious links, many web spam detection algorithms exploit the hyperlink structure between the web pages to detect the spam pages. In this paper, we conduct a comprehensive analysis of the link structure of web spam using real-world web graphs to systemically investigate the characteristics of the link-based web spam. By exploring the structure of the page-level graph as well as the site-level graph, we propose a scalable site-level seeding methodology for the Anti-TrustRank (ATR) algorithm. The key idea is to map a website into a feature space where we learn a classifier to prioritize the websites so that we can effectively select a set of good seeds for the ATR algorithm. This seeding method enables the ATR algorithm to detect the largest number of spam pages among the competitive baseline methods. Furthermore, we design work-efficient asynchronous ATR algorithms which are able to significantly reduce the computational cost of the traditional ATR algorithm without degrading the performance in detecting spam pages while guaranteeing the convergence.

[1]  Joyce Jiyoung Whang,et al.  Hyperlink Classification via Structured Graph Embedding , 2019, SIGIR.

[2]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[3]  Christos Faloutsos,et al.  Sampling from large graphs , 2006, KDD '06.

[4]  Brian D. Davison,et al.  Topical TrustRank: using topicality to combat web spam , 2006, WWW '06.

[5]  Inderjit S. Dhillon,et al.  Scalable and Memory-Efficient Clustering of Large-Scale Social Networks , 2012, 2012 IEEE 12th International Conference on Data Mining.

[6]  Ashish Goel,et al.  Personalized PageRank Estimation and Search: A Bidirectional Approach , 2015, WSDM.

[7]  Xianchao Zhang,et al.  Propagating Both Trust and Distrust with Target Differentiation for Combating Link-Based Web Spam , 2014, TWEB.

[8]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[9]  David F. Gleich,et al.  Approximating Personalized PageRank with Minimal Use of Web Graph Data , 2006, Internet Math..

[10]  Yiqun Liu,et al.  Fighting against web spam: a novel propagation method based on click-through data , 2012, SIGIR '12.

[11]  Joyce Jiyoung Whang,et al.  Non-Exhaustive, Overlapping Clustering , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Carlos Castillo,et al.  Graph regularization methods for Web spam detection , 2010, Machine Learning.

[13]  Rashmi Raj,et al.  Web Spam Detection with Anti-Trust Rank , 2006, AIRWeb.

[14]  David F. Gleich,et al.  Fast Parallel PageRank: A Linear System Approach , 2004 .

[15]  Xianchao Zhang,et al.  Automatic seed set expansion for trust propagation based anti-spamming algorithms , 2009, WIDM.

[16]  András A. Benczúr,et al.  Web spam classification: a few features worth more , 2011, WebQuality '11.

[17]  Zhenguo Li,et al.  PowerWalk: Scalable Personalized PageRank via Random Walks with Vertex-Centric Decomposition , 2016, CIKM.

[18]  Inderjit S. Dhillon,et al.  Scalable Data-Driven PageRank: Algorithms, System Issues, and Lessons Learned , 2015, Euro-Par.

[19]  W. Bruce Croft,et al.  Quality-biased ranking of web documents , 2011, WSDM '11.

[20]  Jiawei Han,et al.  Survey on web spam detection: principles and algorithms , 2012, SKDD.

[21]  Luca Becchetti,et al.  Link analysis for Web spam detection , 2008, TWEB.

[22]  Tie-Yan Liu,et al.  Let web spammers expose themselves , 2011, WSDM '11.

[23]  Frank McSherry,et al.  A uniform approach to accelerated PageRank computation , 2005, WWW '05.

[24]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[25]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[26]  Hector Garcia-Molina,et al.  Link spam detection based on mass estimation , 2006, VLDB.

[27]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[28]  Joyce Jiyoung Whang,et al.  Fast Asynchronous Anti-TrustRank for Web Spam Detection , 2018 .

[29]  Akebo Yamakami,et al.  An Analysis of Machine Learning Methods for Spam Host Detection , 2012, 2012 11th International Conference on Machine Learning and Applications.

[30]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[31]  Juan Martínez-Romo,et al.  Web Spam Detection: New Classification Features Based on Qualified Link Analysis and Language Models , 2010, IEEE Transactions on Information Forensics and Security.

[32]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[33]  Brian D. Davison,et al.  Identifying link farm spam pages , 2005, WWW '05.