Spammer Classification Using Ensemble Methods over Content-Based Features

As the web documents are raising at high scale, it is very difficult to access useful information. Search engines play a major role in retrieval of relevant information and knowledge. They deal with managing large amount of information with efficient page ranking algorithms. Still web spammers try to intrude the search engine results by various web spamming techniques for their personal benefit. According to the recent report from Internetlivestats in March (2016), an Internet survey company, states that there are currently 3.4 billion Internet users in the world. From this survey it can be judged that the search engines play a vital role in retrieval of information. In this research, we have investigated fifteen different machine learning classification algorithms over content based features to classify the spam and non spam web pages. Ensemble approach is done by using three algorithms which are computed as best on the basis of various parameters. Ten Fold Cross-validation approach is also used.

[1]  M. Basavaraju,et al.  A Novel Method of Spam Mail Detection using Text Based Clustering Approach , 2010 .

[2]  Archana Bhattarai,et al.  Characterizing comment spam in the blogosphere through content analysis , 2009, 2009 IEEE Symposium on Computational Intelligence in Cyber Security.

[3]  Nasser Yazdani,et al.  DistanceRank: An intelligent ranking algorithm for web pages , 2008, Inf. Process. Manag..

[4]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[5]  Wenpu Xing,et al.  Weighted PageRank algorithm , 2004, Proceedings. Second Annual Conference on Communication Networks and Services Research, 2004..

[6]  Jon M. Kleinberg,et al.  The Web as a Graph: Measurements, Models, and Methods , 1999, COCOON.

[7]  Jiawei Han,et al.  Survey on web spam detection: principles and algorithms , 2012, SKDD.

[8]  Brian D. Davison,et al.  Identifying link farm spam pages , 2005, WWW '05.

[9]  K. Suresh Joseph,et al.  Page ranking algorithms used in Web Mining , 2014, International Conference on Information Communication and Embedded Systems (ICICES2014).

[10]  Hector Garcia-Molina,et al.  Link spam detection based on mass estimation , 2006, VLDB.

[11]  Shlomo Moran,et al.  The stochastic approach for link-structure analysis (SALSA) and the TKC effect , 2000, Comput. Networks.

[12]  Jian Pei,et al.  Link spam target detection using page farms , 2009, TKDD.

[13]  Luca Becchetti,et al.  Link analysis for Web spam detection , 2008, TWEB.