Propagating Trust and Distrust to Demote Web Spam

Web spamming describes behavior that attempts to deceive search engine’s ranking algorithms. TrustRank is a recent algorithm that can combat web spam by propagating trust among web pages. However, TrustRank propagates trust among web pages based on the number of outgoing links, which is also how PageRank propagates authority scores among Web pages. This type of propagation may be suited for propagating authority, but it is not optimal for calculating trust scores for demoting spam sites. In this paper, we propose several alternative methods to propagate trust on the web. With experiments on a real web data set, we show that these methods can greatly decrease the number of web spam sites within the top portion of the trust ranking. In addition, we investigate the possibility of propagating distrust among web pages. Experiments show that combining trust and distrust values can demote more spam sites than the sole use of trust values.

[1]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[2]  Georg Lausen,et al.  Spreading activation models for trust propagation , 2004, IEEE International Conference on e-Technology, e-Commerce and e-Service, 2004. EEE '04. 2004.

[3]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[4]  Yong Chen,et al.  Trust Propagation in Small Worlds , 2003, iTrust.

[5]  Paolo Massa,et al.  Page-reRank: using trusted links to re-rank authority , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[6]  R. Guha,et al.  Open Rating Systems , 2002 .

[7]  Gilad Mishne,et al.  Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.

[8]  Gene H. Golub,et al.  Extrapolation methods for accelerating PageRank computations , 2003, WWW '03.

[9]  Ramanathan V. Guha,et al.  Propagation of trust and distrust , 2004, WWW '04.

[10]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[11]  Andrei Z. Broder,et al.  Sic transit gloria telae: towards an understanding of the web's decay , 2004, WWW '04.

[12]  Brian D. Davison,et al.  Identifying link farm spam pages , 2005, WWW '05.

[13]  András A. Benczúr,et al.  SpamRank -- Fully Automatic Link Spam Detection , 2005, AIRWeb.

[14]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[15]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[16]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[17]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[18]  Brian D. Davison,et al.  Topical TrustRank: using topicality to combat web spam , 2006, WWW '06.

[19]  Marc Najork,et al.  Detecting phrase-level duplication on the world wide web , 2005, SIGIR '05.

[20]  Tobias Scheffer,et al.  Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam , 2005, ECML.

[21]  Sriram Raghavan,et al.  WebBase: a repository of Web pages , 2000, Comput. Networks.

[22]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.