Fighting against web spam: a novel propagation method based on click-through data

Combating Web spam is one of the greatest challenges for Web search engines. State-of-the-art anti-spam techniques focus mainly on detecting varieties of spam strategies, such as content spamming and link-based spamming. Although these anti-spam approaches have had much success, they encounter problems when fighting against a continuous barrage of new types of spamming techniques. We attempt to solve the problem from a new perspective, by noticing that queries that are more likely to lead to spam pages/sites have the following characteristics: 1) they are popular or reflect heavy demands for search engine users and 2) there are usually few key resources or authoritative results for them. From these observations, we propose a novel method that is based on click-through data analysis by propagating the spamicity score iteratively between queries and URLs from a few seed pages/sites. Once we obtain the seed pages/sites, we use the link structure of the click-through bipartite graph to discover other pages/sites that are likely to be spam. Experiments show that our algorithm is both efficient and effective in detecting Web spam. Moreover, combining our method with some popular anti-spam techniques such as TrustRank achieves improvement compared with each technique taken individually.

[1]  Steven D. Gribble,et al.  A Crawler-based Study of Spyware in the Web , 2006, NDSS.

[2]  Thomas Lavergne,et al.  Tracking Web spam with HTML style similarities , 2008, TWEB.

[3]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[4]  Hector Garcia-Molina,et al.  Spam: it's not just for inboxes anymore , 2005, Computer.

[5]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[6]  Torsten Suel,et al.  Cleaning search results using term distance features , 2008, AIRWeb '08.

[7]  Dawid Weiss,et al.  Exploring linguistic features for web spam detection: a preliminary study , 2008, AIRWeb '08.

[8]  Brian D. Davison,et al.  Adversarial Web Search , 2011, Found. Trends Inf. Retr..

[9]  András A. Benczúr,et al.  Web spam classification: a few features worth more , 2011, WebQuality '11.

[10]  Susan T. Dumais,et al.  Improving Web Search Ranking by Incorporating User Behavior Information , 2019, SIGIR Forum.

[11]  Brian D. Davison,et al.  Winnowing wheat from the chaff: propagating trust to sift spam from the web , 2007, SIGIR.

[12]  Yiqun Liu,et al.  Identifying web spam with user behavior analysis , 2008, AIRWeb '08.

[13]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[14]  Tie-Yan Liu,et al.  Let web spammers expose themselves , 2011, WSDM '11.

[15]  Brian D. Davison,et al.  Propagating Trust and Distrust to Demote Web Spam , 2006, MTW.

[16]  Amit Singhal,et al.  Challenges in running a commercial search engine , 2005, SIGIR '05.

[17]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[18]  Brian D. Davison,et al.  Identifying link farm spam pages , 2005, WWW '05.

[19]  Kumar Chellapilla,et al.  A taxonomy of JavaScript redirection spam , 2007, AIRWeb '07.

[20]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[21]  David Maxwell Chickering,et al.  Improving Cloaking Detection using Search Query Popularity and Monetizability , 2006, AIRWeb.

[22]  Tie-Yan Liu,et al.  BrowseRank: letting web users vote for page importance , 2008, SIGIR '08.

[23]  Juan Martínez-Romo,et al.  Web spam identification through language model analysis , 2009, AIRWeb '09.

[24]  Brian D. Davison,et al.  Detecting semantic cloaking on the web , 2006, WWW '06.

[25]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[26]  HenzingerMonika,et al.  Analysis of a very large web search engine query log , 1999 .

[27]  Thomas Lavergne,et al.  Tracking Web Spam with Hidden Style Similarity , 2006, AIRWeb.