Web spam detection via commercial intent analysis

We propose a number of features for Web spam filtering based on the occurrence of keywords that are either of high advertisement value or highly spammed. Our features include popular words from search engine query logs as well as high cost or volume words according to Google AdWords. We also demonstrate the spam filtering power of the Online Commercial Intention (OCI) value assigned to an URL in a Microsoft adCenter Labs Demonstration and the Yahoo! Mindset classification of Web pages as either commercial or non-commercial as well as metrics based on the occurrence of Google ads on the page. We run our tests on the WEBSPAM-UK2006 dataset recently compiled by Castillo et al. as a standard means of measuring the performance of Web spam detection algorithms. Our features improve the classification accuracy of the publicly available WEBSPAM-UK2006 features by 3%.

[1]  András A. Benczúr,et al.  Searching a Small National Domain - Preliminary Report , 2003, WWW.

[2]  Hao Chen,et al.  Spam double-funnel: connecting web spammers with advertisers , 2007, WWW '07.

[3]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[4]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[5]  Ying Li,et al.  Detecting online commercial intention (OCI) , 2006, WWW '06.

[6]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[7]  Brian D. Davison,et al.  Propagating Trust and Distrust to Demote Web Spam , 2006, MTW.

[8]  Kevin S. McCurley,et al.  Ranking the web frontier , 2004, WWW '04.

[9]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[10]  Ronald Fagin,et al.  Searching the workplace web , 2003, WWW '03.

[11]  András A. Benczúr,et al.  SpamRank -- Fully Automatic Link Spam Detection , 2005, AIRWeb.

[12]  Hector Garcia-Molina,et al.  Spam: it's not just for inboxes anymore , 2005, Computer.

[13]  András A. Benczúr,et al.  Link-Based Similarity Search to Fight Web Spam , 2006, AIRWeb.

[14]  EnginesMonika,et al.  Challenges in Web Sear h , 2002 .

[15]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[16]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[17]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[18]  Tobias Scheffer,et al.  Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam , 2005, ECML.

[19]  Ian Witten,et al.  Data Mining , 2000 .

[20]  Luca Becchetti,et al.  A reference collection for web spam , 2006, SIGF.