Deeply Exploiting Link Structure : Setting a Tougher Life for Spammers

Previous anti-spamming algorithms based on link structure suffer from either the weakness of the page value metric or the vagueness of the seed selection. In this paper, we propose two page value metrics, AVRank and HVRank. These two “values” of all the web pages can be well assessed by using the bidirectional links’ information. Moreover, with the help of bidirectional links, it becomes easier to enlarge the propagation coverage and reduce the bias of seed sets. We further discuss the effectiveness of the combination of these two metrics, such as the quadratic mean of them. Our experimental results show that with such two metrics, an automatically selected large seed set can achieve a better propagation coverage as well as less bias of ranking results. Most important, our method can filter out spam sites and identify reputable sites more effectively.

[1]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[2]  Hector Garcia-Molina,et al.  Link spam detection based on mass estimation , 2006, VLDB.

[3]  Carl D. Meyer,et al.  Deeper Inside PageRank , 2004, Internet Math..

[4]  Allan Borodin,et al.  Finding authorities and hubs from link structures on the World Wide Web , 2001, WWW '01.

[5]  Michael R. Lyu,et al.  DiffusionRank: a possible penicillin for web spamming , 2007, SIGIR.

[6]  Assem S. Deif,et al.  Advanced matrix theory for scientists and engineers , 1990 .

[7]  Ricardo A. Baeza-Yates,et al.  Web Structure, Dynamics and Page Quality , 2002, SPIRE.

[8]  Brian D. Davison,et al.  Identifying link farm spam pages , 2005, WWW '05.

[9]  Brian D. Davison,et al.  Topical TrustRank: using topicality to combat web spam , 2006, WWW '06.

[10]  Luca Becchetti,et al.  Link-Based Characterization and Detection of Web Spam , 2006, AIRWeb.

[11]  Taher H. Haveliwala,et al.  The Condition Number of the PageRank Problem , 2003 .

[12]  Yan Zhang,et al.  From Good to Bad Ones: Making Spam Detection Easier , 2008, 2008 IEEE 8th International Conference on Computer and Information Technology Workshops.

[13]  Eli Upfal,et al.  The Web as a graph , 2000, PODS.

[14]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[15]  Brian D. Davison,et al.  Propagating Trust and Distrust to Demote Web Spam , 2006, MTW.

[16]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[17]  Rashmi Raj,et al.  Web Spam Detection with Anti-Trust Rank , 2006, AIRWeb.

[18]  Panagiotis Takis Metaxas,et al.  Web Spam, Propaganda and Trust , 2005, AIRWeb.

[19]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[20]  Hector Garcia-Molina,et al.  Link Spam Alliances , 2005, VLDB.

[21]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[22]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[23]  Xinchang Zhang,et al.  Link based small sample learning for web spam detection , 2009, WWW '09.

[24]  Marcin Sydow,et al.  Random surfer with back step , 2004, WWW Alt. '04.

[25]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[26]  Franco Scarselli,et al.  Inside PageRank , 2005, TOIT.

[27]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[28]  Luca Becchetti,et al.  Using rank propagation and Probabilistic counting for Link-Based Spam Detection , 2006 .