论文信息 - Fighting against web spam: a novel propagation method based on click-through data - 字舞流文

Fighting against web spam: a novel propagation method based on click-through data

Combating Web spam is one of the greatest challenges for Web search engines. State-of-the-art anti-spam techniques focus mainly on detecting varieties of spam strategies, such as content spamming and link-based spamming. Although these anti-spam approaches have had much success, they encounter problems when fighting against a continuous barrage of new types of spamming techniques. We attempt to solve the problem from a new perspective, by noticing that queries that are more likely to lead to spam pages/sites have the following characteristics: 1) they are popular or reflect heavy demands for search engine users and 2) there are usually few key resources or authoritative results for them. From these observations, we propose a novel method that is based on click-through data analysis by propagating the spamicity score iteratively between queries and URLs from a few seed pages/sites. Once we obtain the seed pages/sites, we use the link structure of the click-through bipartite graph to discover other pages/sites that are likely to be spam. Experiments show that our algorithm is both efficient and effective in detecting Web spam. Moreover, combining our method with some popular anti-spam techniques such as TrustRank achieves improvement compared with each technique taken individually.

Yiqun Liu | Min Zhang | Shaoping Ma | Chao Wei | Kuo Zhang | Liyun Ru

[1] Steven D. Gribble,et al. A Crawler-based Study of Spyware in the Web , 2006, NDSS.

[2] Thomas Lavergne,et al. Tracking Web spam with HTML style similarities , 2008, TWEB.

[3] Hector Garcia-Molina,et al. Web Spam Taxonomy , 2005, AIRWeb.

[4] Hector Garcia-Molina,et al. Spam: it's not just for inboxes anymore , 2005, Computer.

[5] Zoubin Ghahramani,et al. Learning from labeled and unlabeled data with label propagation , 2002 .

[6] Torsten Suel,et al. Cleaning search results using term distance features , 2008, AIRWeb '08.

[7] Dawid Weiss,et al. Exploring linguistic features for web spam detection: a preliminary study , 2008, AIRWeb '08.

[8] Brian D. Davison,et al. Adversarial Web Search , 2011, Found. Trends Inf. Retr..

[9] András A. Benczúr,et al. Web spam classification: a few features worth more , 2011, WebQuality '11.

[10] Susan T. Dumais,et al. Improving Web Search Ranking by Incorporating User Behavior Information , 2019, SIGIR Forum.

[11] Brian D. Davison,et al. Winnowing wheat from the chaff: propagating trust to sift spam from the web , 2007, SIGIR.

[12] Yiqun Liu,et al. Identifying web spam with user behavior analysis , 2008, AIRWeb '08.

[13] Marc Najork,et al. Detecting spam web pages through content analysis , 2006, WWW '06.

[14] Tie-Yan Liu,et al. Let web spammers expose themselves , 2011, WSDM '11.

[15] Brian D. Davison,et al. Propagating Trust and Distrust to Demote Web Spam , 2006, MTW.

[16] Amit Singhal,et al. Challenges in running a commercial search engine , 2005, SIGIR '05.

[17] Fabrizio Silvestri,et al. Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[18] Brian D. Davison,et al. Identifying link farm spam pages , 2005, WWW '05.

[19] Kumar Chellapilla,et al. A taxonomy of JavaScript redirection spam , 2007, AIRWeb '07.

[20] Monika Henzinger,et al. Analysis of a very large web search engine query log , 1999, SIGF.

[21] David Maxwell Chickering,et al. Improving Cloaking Detection using Search Query Popularity and Monetizability , 2006, AIRWeb.

[22] Tie-Yan Liu,et al. BrowseRank: letting web users vote for page importance , 2008, SIGIR '08.

[23] Juan Martínez-Romo,et al. Web spam identification through language model analysis , 2009, AIRWeb '09.

[24] Brian D. Davison,et al. Detecting semantic cloaking on the web , 2006, WWW '06.

[25] Hector Garcia-Molina,et al. Combating Web Spam with TrustRank , 2004, VLDB.

[26] HenzingerMonika,et al. Analysis of a very large web search engine query log , 1999 .

[27] Thomas Lavergne,et al. Tracking Web Spam with Hidden Style Similarity , 2006, AIRWeb.