Automatic seed set expansion for trust propagation based anti-spam algorithms

Seed sets are of significant importance to trust propagation based anti-spam algorithms, e.g., TrustRank. Conventional approaches require manual evaluation to construct a seed set, which restricts the seed set to be small in size, since it would cost too much and may even be impossible to construct a very large seed set manually. The detrimental effect will be caused to the final ranking results by the small-sized seed sets. Thus, it is desirable to automatically expand an initial seed set to a larger one. In this paper, we propose an automatic seed set expansion algorithm (ASE) which enriches a small seed set to a much larger one. The intuition behind ASE is that if a page is recommended by a number of trustworthy pages, the page itself should be trustworthy as well. Since links on the Web can be considered as a tool for conveying recommendation, we call links recommending the same page a joint recommendation link structure. The joint recommendation link structures with large enough support degrees are employed by ASE algorithm to obtain new seeds. It can be proved that using the joint recommendation link structure with a suitable support degree, the probability of selecting a spam page as a new seed almost to zero, thus the quality of the expanded seed set can be guaranteed. Experimental results on the WEBSPAM-UK2007 dataset show that with the same manual evaluation efforts, ASE can automatically obtain a lot of reputable seeds with very high quality, and significantly improves the performance of trust propagation based algorithms such as TrustRank and CPV (Computing Page Values).

[1]  Tao Tao,et al.  Transductive link spam detection , 2007, AIRWeb '07.

[2]  Kentaro Inui,et al.  Web Spam Detection by Exploring Densely Connected Subgraphs , 2011, 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[3]  Xianchao Zhang,et al.  Propagating Both Trust and Distrust with Target Differentiation for Combating Web Spam , 2011, AAAI.

[4]  U. Feige,et al.  On the densest k-subgraph problems , 1997 .

[5]  Lada A. Adamic,et al.  Friends and neighbors on the Web , 2003, Soc. Networks.

[6]  Rashmi Raj,et al.  Web Spam Detection with Anti-Trust Rank , 2006, AIRWeb.

[7]  Bernhard Schölkopf,et al.  Ranking on Data Manifolds , 2003, NIPS.

[8]  Xianchao Zhang,et al.  Automatic seed set expansion for trust propagation based anti-spamming algorithms , 2009, WIDM.

[9]  Tie-Yan Liu,et al.  BrowseRank: letting web users vote for page importance , 2008, SIGIR '08.

[10]  Wolfgang Nejdl,et al.  MailRank: using ranking for spam detection , 2005, CIKM '05.

[11]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[12]  David M. Pennock,et al.  The structure of broad topics on the web , 2002, WWW.

[13]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[14]  Pavel Berkhin,et al.  A Survey on PageRank Computing , 2005, Internet Math..

[15]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[16]  Brian D. Davison,et al.  Topical TrustRank: using topicality to combat web spam , 2006, WWW '06.

[17]  Hector Garcia-Molina,et al.  Link spam detection based on mass estimation , 2006, VLDB.

[18]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[19]  Ronald Rosenfeld,et al.  Semi-supervised learning with graphs , 2005 .

[20]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[21]  Yan Zhang,et al.  Larger is better: seed selection in link-based anti-spamming algorithms , 2008, WWW.

[22]  Wai-Ki Ching A note on the paper: Optimizing web servers using page rank prefetching for clustered accesses , 2005, Inf. Sci..

[23]  Carlos Castillo,et al.  Graph regularization methods for Web spam detection , 2010, Machine Learning.

[24]  Soumen Chakrabarti,et al.  Learning random walks to rank nodes in graphs , 2007, ICML '07.

[25]  Brian D. Davison,et al.  Identifying link farm spam pages , 2005, WWW '05.

[26]  András A. Benczúr,et al.  SpamRank -- Fully Automatic Link Spam Detection , 2005, AIRWeb.

[27]  Luca Becchetti,et al.  Using rank propagation and Probabilistic counting for Link-Based Spam Detection , 2006 .

[28]  Jian Pei,et al.  A Spamicity Approach to Web Spam Detection , 2008, SDM.

[29]  Young Ae Kim,et al.  A trust prediction framework in rating-based experience sharing social networks without a Web of Trust , 2012, Inf. Sci..

[30]  Meng Wang,et al.  Unified Video Annotation via Multigraph Learning , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[31]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[32]  Michael R. Lyu,et al.  DiffusionRank: a possible penicillin for web spamming , 2007, SIGIR.

[33]  Manish Parashar,et al.  Optimizing Web servers using Page rank prefetching for clustered accesses , 2003, Inf. Sci..

[34]  Xian-Sheng Hua,et al.  Towards a Relevant and Diverse Search of Social Images , 2010, IEEE Transactions on Multimedia.

[35]  Baoning Wu,et al.  Extracting link spam using biased random walks from spam seed sets , 2007, AIRWeb '07.

[36]  Brian D. Davison,et al.  Looking into the past to better classify web spam , 2009, AIRWeb '09.

[37]  Yan Zhang,et al.  Exploiting bidirectional links: making spamming detection easier , 2009, CIKM.