Multi-View Learning for Web Spam Detection

Spam pages are designed to maliciously appear among the top search results by excessive usage of popular terms. Therefore, spam pages should be removed using an effective and efficient spam detection system. Previous methods for web spam classification used several features from various information sources (page contents, web graph, access logs, etc.) to detect web spam. In this paper, we follow page-level classification approach to build fast and scalable spam filters. We show that each web page can be classified with satisfactory accuracy using only its own HTML content. In order to design a multi-view classification system, we used state-of-the-art spam classification methods with distinct feature sets (views) as the base classifiers. Then, a fusion model is learned to combine the output of the base classifiers and make final prediction. Results on our Persian web spam dataset show that multi-view learning significantly improves the classification performance, namely AUC by 22%, while providing linear speedup for parallel execution.

[1]  Xinchang Zhang,et al.  Evaluating Web Content Quality via Multi-scale Features , 2013, ArXiv.

[3]  András A. Benczúr,et al.  Web spam classification: a few features worth more , 2011, WebQuality '11.

[4]  Ricardo A. Baeza-Yates,et al.  Generalizing PageRank: damping functions for link-based ranking algorithms , 2006, SIGIR.

[5]  Brian D. Davison,et al.  Adversarial Web Search , 2011, Found. Trends Inf. Retr..

[6]  Luca Becchetti,et al.  A reference collection for web spam , 2006, SIGF.

[7]  Michael R. Lyu,et al.  DiffusionRank: a possible penicillin for web spamming , 2007, SIGIR.

[8]  Gordon V. Cormack University of Waterloo Participation in the TREC 2007 Spam Track , 2007, TREC.

[9]  Charles L. A. Clarke,et al.  Efficient and effective spam filtering and re-ranking for large web datasets , 2010, Information Retrieval.

[10]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[11]  Fabio Roli,et al.  Multiple classifier systems for robust classifier design in adversarial environments , 2010, Int. J. Mach. Learn. Cybern..

[12]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[13]  Hector Garcia-Molina,et al.  Spam: it's not just for inboxes anymore , 2005, Computer.

[14]  Luca Becchetti,et al.  Link analysis for Web spam detection , 2008, TWEB.

[15]  Qiang Wu,et al.  Improving web spam classification using rank-time features , 2007, AIRWeb '07.

[16]  W. Bruce Croft,et al.  Quality-biased ranking of web documents , 2011, WSDM '11.

[17]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[18]  Rashmi Raj,et al.  Web Spam Detection with Anti-Trust Rank , 2006, AIRWeb.

[19]  Marc Najork,et al.  Spam, Damn Spam, and Statistics , 2004 .

[20]  András A. Benczúr,et al.  SpamRank - fully automatic link spam detection. Work in progress , 2005 .

[21]  Masaru Kitsuregawa,et al.  Identifying spam link generators for monitoring emerging web spam , 2010, WICOW '10.

[22]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[23]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[24]  András A. Benczúr,et al.  Temporal Analysis for Web Spam Detection: An Overview , 2011, TWAW.

[25]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[26]  Yiqun Liu,et al.  Identifying Web Spam with the Wisdom of the Crowds , 2012, TWEB.

[27]  Fiana Raiber Adversarial content manipulation effects , 2012, SIGIR '12.