Automatic seed set expansion for trust propagation based anti-spamming algorithms

Seed sets are of significant importance for trust propagation based anti-spamming algorithms, e.g., TrustRank. Conventional approaches require manual evaluation to construct a seed set, which restricts the seed set to be small in size, since it would cost too much and may even be impossible to construct a very large seed set manually. The small-sized seed set can cause detrimental effect on the final ranking results. Thus, it is desirable to automatically expand an initial seed set to a much larger one. In this paper, we propose the first automatic seed set expansion algorithm (ASE), which expands a small seed set by selecting reputable seeds that are found and guaranteed to be reputable through a joint recommendation link structure. Experimental results on the WEBSPAM-2007 dataset show that with the same manual evaluation efforts, ASE can automatically obtain a large number of reputable seeds with high precision, thus significantly improving the performance of the baseline algorithm in terms of both reputable site promotion and spam site demotion.

[1]  Lada A. Adamic,et al.  Friends and neighbors on the Web , 2003, Soc. Networks.

[2]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[3]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[4]  Luca Becchetti,et al.  Using rank propagation and Probabilistic counting for Link-Based Spam Detection , 2006 .

[5]  Rashmi Raj,et al.  Web Spam Detection with Anti-Trust Rank , 2006, AIRWeb.

[6]  Hector Garcia-Molina,et al.  Link spam detection based on mass estimation , 2006, VLDB.

[7]  Jian Pei,et al.  A Spamicity Approach to Web Spam Detection , 2008, SDM.

[8]  Brian D. Davison,et al.  Identifying link farm pages , 2005, WWW 2005.

[9]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[10]  Brian D. Davison,et al.  Topical TrustRank: using topicality to combat web spam , 2006, WWW '06.

[11]  Baoning Wu,et al.  Extracting link spam using biased random walks from spam seed sets , 2007, AIRWeb '07.

[12]  Brian D. Davison,et al.  Looking into the past to better classify web spam , 2009, AIRWeb '09.

[13]  Brian D. Davison,et al.  Identifying link farm spam pages , 2005, WWW '05.

[14]  András A. Benczúr,et al.  SpamRank - fully automatic link spam detection. Work in progress , 2005 .

[15]  Michael R. Lyu,et al.  DiffusionRank: a possible penicillin for web spamming , 2007, SIGIR.

[16]  Pavel Berkhin,et al.  A Survey on PageRank Computing , 2005, Internet Math..

[17]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[18]  Tie-Yan Liu,et al.  BrowseRank: letting web users vote for page importance , 2008, SIGIR '08.

[19]  Wolfgang Nejdl,et al.  MailRank: using ranking for spam detection , 2005, CIKM '05.

[20]  Yan Zhang,et al.  Larger is better: seed selection in link-based anti-spamming algorithms , 2008, WWW.

[21]  Brian D. Davison,et al.  Propagating Trust and Distrust to Demote Web Spam , 2006, MTW.