Web Spam Detection by Learning from Small Labeled Samples

Web spamming tries to deceive search engines to rank some pages higher than they deserve. Many methods have been proposed to combat web spamming and to detect spam pages. One basic method is using classification, i.e., learning a classification model from previously labeled training data and using this model for classifying web pages to spam or nonspam. A drawback of this method is that manually labeling a large number of web pages to generate the training data can be biased, non-accurate, labor intensive and time consuming. In this paper, we are going to propose a new method to resolve this drawback by using semi-supervised learning to automatically label the training data. To do this, we incorporate Expectation-Maximization algorithm that is an efficient and an important algorithm of semi-supervised learning. Experiments are carried out on the real web spam data, which show the new method, performs very well in practice. General Terms Information Retrieval, Search Engine, Machine Learning.

[1]  András A. Benczúr,et al.  Web spam classification: a few features worth more , 2011, WebQuality '11.

[2]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[3]  Ling Liu,et al.  Spam-Resilient Web Rankings via Influence Throttling , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[4]  Yiqun Liu,et al.  Identifying web spam with user behavior analysis , 2008, AIRWeb '08.

[5]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[6]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[7]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[8]  Ling Liu,et al.  A Parameterized Approach to Spam-Resilient Link Analysis of the Web , 2009, IEEE Transactions on Parallel and Distributed Systems.

[9]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[10]  Luca Becchetti,et al.  A reference collection for web spam , 2006, SIGF.

[11]  Luca Becchetti,et al.  Link-Based Characterization and Detection of Web Spam , 2006, AIRWeb.

[12]  Guosun Zeng,et al.  Using evidence based content trust model for spam detection , 2010, Expert Syst. Appl..

[13]  Ling Liu,et al.  Countering web spam with credibility-based link analysis , 2007, PODC '07.

[14]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[15]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..