Topical TrustRank: using topicality to combat web spam

Web spam is behavior that attempts to deceive search engine ranking algorithms. TrustRank is a recent algorithm that can combat web spam. However, TrustRank is vulnerable in the sense that the seed set used by TrustRank may not be sufficiently representative to cover well the different topics on the Web. Also, for a given seed set, TrustRank has a bias towards larger communities. We propose the use of topical information to partition the seed set and calculate trust scores for each topic separately to address the above issues. A combination of these trust scores for a page is used to determine its ranking. Experimental results on two large datasets show that our Topical TrustRank has a better performance than TrustRank in demoting spam sites or pages. Compared to TrustRank, our best technique can decrease spam from the top ranked sites by as much as 43.1%.

[1]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[2]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[3]  Craig Silverstein,et al.  Analysis of a Very Large Altavista Query Log" SRC Technical note #1998-14 , 1998 .

[4]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.

[5]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[6]  Dan Klein,et al.  Evaluating strategies for similarity search on the web , 2002, WWW '02.

[7]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[8]  Ramanathan V. Guha,et al.  Propagation of trust and distrust , 2004, WWW '04.

[9]  Brian D. Davison,et al.  Identifying link farm spam pages , 2005, WWW '05.

[10]  András A. Benczúr,et al.  SpamRank -- Fully Automatic Link Spam Detection , 2005, AIRWeb.

[11]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[12]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[13]  Sriram Raghavan,et al.  WebBase: a repository of Web pages , 2000, Comput. Networks.

[14]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[15]  Marc Najork,et al.  Detecting phrase-level duplication on the world wide web , 2005, SIGIR '05.

[16]  Tobias Scheffer,et al.  Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam , 2005, ECML.

[17]  Gilad Mishne,et al.  Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.

[18]  David M. Pennock,et al.  The structure of broad topics on the web , 2002, WWW.

[19]  Wolfgang Nejdl,et al.  Using ODP metadata to personalize search , 2005, SIGIR '05.

[20]  Brian D. Davison,et al.  Cloaking and Redirection: A Preliminary Study , 2005, AIRWeb.

[21]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[22]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.