Web Spam Detection: New Classification Features Based on Qualified Link Analysis and Language Models

Web spam is a serious problem for search engines because the quality of their results can be severely degraded by the presence of this kind of page. In this paper, we present an efficient spam detection system based on a classifier that combines new link-based features with language-model (LM)-based ones. These features are not only related to quantitative data extracted from the Web pages, but also to qualitative properties, mainly of the page links. We consider, for instance, the ability of a search engine to find, using information provided by the page for a given link, the page that the link actually points at. This can be regarded as indicative of the link reliability. We also check the coherence between a page and another one pointed at by any of its links. Two pages linked by a hyperlink should be semantically related, by at least a weak contextual relation. Thus, we apply an LM approach to different sources of information from a Web page that belongs to the context of a link, in order to provide high-quality indicators of Web spam. We have specifically applied the Kullback-Leibler divergence on different combinations of these sources of information in order to characterize the relationship between two linked pages. The result is a system that significantly improves the detection of Web spam using fewer features, on two large and public datasets SUchasWEBSPAM-UK2006 and WEBSPAM-UK2007.

[1]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[2]  Ian Witten,et al.  Data Mining , 2000 .

[3]  Stephen E. Robertson,et al.  Effective site finding using link anchor information , 2001, SIGIR '01.

[4]  András A. Benczúr,et al.  Detecting nepotistic links by language model disagreement , 2006, WWW '06.

[5]  Rong Jin,et al.  Title language model for information retrieval , 2002, SIGIR '02.

[6]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[7]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[8]  Brian D. Davison,et al.  Measuring similarity to detect qualified links , 2007, AIRWeb '07.

[9]  Luca Becchetti,et al.  A reference collection for web spam , 2006, SIGF.

[10]  Calton Pu,et al.  Predicting web spam with HTTP session information , 2008, CIKM '08.

[11]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[12]  Kevin S. McCurley,et al.  Analysis of anchor text for web search , 2003, SIGIR.

[13]  Juan Martínez-Romo,et al.  Retrieving broken web links using an approach based on contextual information , 2009, HT '09.

[14]  Juan Martínez-Romo,et al.  Recommendation System for Automatic Recovery of Broken Web Links , 2008, IBERAMIA.

[15]  Gilad Mishne,et al.  Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.

[16]  Juan Martínez-Romo,et al.  Web spam identification through language model analysis , 2009, AIRWeb '09.

[17]  Brian D. Davison Recognizing Nepotistic Links on the Web , 2000 .

[18]  András A. Benczúr,et al.  SpamRank -- Fully Automatic Link Spam Detection , 2005, AIRWeb.

[19]  Luca Becchetti,et al.  Link-Based Characterization and Detection of Web Spam , 2006, AIRWeb.

[20]  Carlos Castillo,et al.  Web spam identification through content and hyperlinks , 2008, AIRWeb '08.

[21]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.