Detecting spam webpages through topic and semantics analysis

Spam web pages have posed great challenges to the development of search engines. The content spam is among the commonly used. Along with the development of Internet technologies, the content spam is difficult to detect. The current detection methods for the web page using content spam technique primarily rely on the statistical features, which has obvious limitations. In this article, a spam webpage detection method based on topic and semantics was proposed, with the use of two categories of features, namely, semantics and statistics. Topic modeling was first performed over the contents of the webpage, with the webpage contents mapped into the topic space. This was followed by semantic analysis and calculation in the topic space according to the distribution of topics. Semantic features were extracted for the classification of webpages by combining with the statistical features. The results verified that the proposed method can achieve a better effect.

[1]  Malik Magdon-Ismail,et al.  An analysis of optimal link bombs , 2012, Theor. Comput. Sci..

[2]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[3]  Dawid Weiss,et al.  Exploring linguistic features for web spam detection: a preliminary study , 2008, AIRWeb '08.

[4]  András A. Benczúr,et al.  SpamRank - fully automatic link spam detection. Work in progress , 2005 .

[5]  Juan Martínez-Romo,et al.  Web spam identification through language model analysis , 2009, AIRWeb '09.

[6]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[7]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[8]  Guosun Zeng,et al.  Using evidence based content trust model for spam detection , 2010, Expert Syst. Appl..

[9]  Bin Zhou,et al.  Effectively Detecting Content Spam on the Web Using Topical Diversity Measures , 2012, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[10]  Jácint Szabó,et al.  Latent dirichlet allocation in web spam filtering , 2008, AIRWeb '08.

[11]  Jon M Kleinberg,et al.  Hubs, authorities, and communities , 1999, CSUR.

[12]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[13]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[14]  Marc Najork,et al.  Detecting phrase-level duplication on the world wide web , 2005, SIGIR '05.

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  Calton Pu,et al.  Evolutionary study of web spam: Webb Spam Corpus 2011 versus Webb Spam Corpus 2006 , 2012, 8th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom).

[17]  Fidel Cacheda,et al.  SAAD, a content based Web Spam Analyzer and Detector , 2013, J. Syst. Softw..

[18]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[19]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[20]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..