Link-based web spam detection using weight properties

Link spam is created with the intention of boosting one target’s rank in exchange of business profit. This unethical way of deceiving Web search engines is known as Web spam. Since then many anti-link spam detection techniques have constantly being proposed. Web spam detection is a crucial task due to its devastation towards Web search engines and global cost of billion dollars annually. In this paper, we proposed a novel technique by incorporating weight properties to enhance the Web spam detection algorithms. Weight properties can be defined as the influences of one Web node towards another Web node. We modified existing Web spam detection algorithms with our novel technique to evaluate the performances on a large public Web spam dataset – WEBSPAM-UK2007. The overall performance have shown that the modified algorithms outperform the benchmark algorithms up to 30.5 % improvement at host level and 6.11 % improvement at page level.

[1]  Ah Chung Tsoi,et al.  Web Spam Detection by Probability Mapping GraphSOMs and Graph Neural Networks , 2010, ICANN.

[2]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[3]  Michael R. Lyu,et al.  DiffusionRank: a possible penicillin for web spamming , 2007, SIGIR.

[4]  Brian D. Davison,et al.  Topical TrustRank: using topicality to combat web spam , 2006, WWW '06.

[5]  Xiaoyan Zhu,et al.  1 R-SpamRank : A Spam Detection Algorithm Based on Link Analysis , 2006 .

[6]  Rashmi Raj,et al.  Web Spam Detection with Anti-Trust Rank , 2006, AIRWeb.

[7]  Luca Becchetti,et al.  Link analysis for Web spam detection , 2008, TWEB.

[8]  A. K. Singh,et al.  Incorporating weight properties in detection of web spam , 2012, 2012 2nd International Conference on Uncertainty Reasoning and Knowledge Engineering.

[9]  Allan Borodin,et al.  Link analysis ranking: algorithms, theory, and experiments , 2005, TOIT.

[10]  Wenpu Xing,et al.  Weighted PageRank algorithm , 2004, Proceedings. Second Annual Conference on Communication Networks and Services Research, 2004..

[11]  Hector Garcia-Molina,et al.  Link spam detection based on mass estimation , 2006, VLDB.

[12]  Yan Zhang,et al.  Exploiting bidirectional links: making spamming detection easier , 2009, CIKM.

[13]  Ophir Frieder,et al.  Repeatable evaluation of search services in dynamic environments , 2007, TOIS.

[14]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[15]  Ashutosh Kumar Singh,et al.  LINK-BASED SPAM ALGORITHMS IN ADVERSARIAL INFORMATION RETRIEVAL , 2012, Cybern. Syst..

[16]  Brian D. Davison,et al.  Winnowing wheat from the chaff: propagating trust to sift spam from the web , 2007, SIGIR.

[17]  Wei Zhang,et al.  Improvement of HITS-based algorithms on web documents , 2002, WWW '02.

[18]  Brian D. Davison,et al.  Propagating Trust and Distrust to Demote Web Spam , 2006, MTW.

[19]  Luca Becchetti,et al.  A reference collection for web spam , 2006, SIGF.

[20]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[21]  Kevin S. McCurley,et al.  Ranking the web frontier , 2004, WWW '04.

[22]  Michael Brinkmeier,et al.  PageRank revisited , 2006, TOIT.

[23]  Song-Nian Yu,et al.  Link Variable TrustRank for Fighting Web Spam , 2008, 2008 International Conference on Computer Science and Software Engineering.

[24]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[25]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[26]  Azadeh Shakery,et al.  DirichletRank: Solving the zero-one gap problem of PageRank , 2008, TOIS.

[27]  Shlomo Moran,et al.  SALSA: the stochastic approach for link-structure analysis , 2001, TOIS.

[28]  Brian D. Davison,et al.  Cloaking and Redirection: A Preliminary Study , 2005, AIRWeb.

[29]  Luca Becchetti,et al.  Using rank propagation and Probabilistic counting for Link-Based Spam Detection , 2006 .

[30]  Brian D. Davison,et al.  Identifying link farm spam pages , 2005, WWW '05.

[31]  Ah Chung Tsoi,et al.  Computational Capabilities of Graph Neural Networks , 2009, IEEE Transactions on Neural Networks.

[32]  Konstantin Avrachenkov,et al.  Weighted PageRank: Cluster-Related Weights , 2008, TREC.

[33]  Xianchao Zhang,et al.  Propagating Both Trust and Distrust with Target Differentiation for Combating Web Spam , 2011, AAAI.