Spam host classification using swarm intelligence

Web Spam, or Spamdexing, is a form of Search Engine Optimization(SEO) spamming that hinders the efficiency of search engines. These types of exploits use unethical methods in order to place a web page into the first rank. Sabotaging the quality of the results retrieved by search engines can lead users to mistrust the search engine provider. Moreover, spam websites can be a starting point for phishing or malware attacks. Over the last decade Web Spamming has become an important problem. This paper shows a spam host detection approach that uses swarm intelligence. We test our model on two datasets (WEBSPAM-UK2011 and WEBSPAM-UK2007) and show that it can obtain a good accuracy. Moreover, we compared our approach with other popular classifiers (C4.5, SVM and Logistic Regression ) and empirically demonstrated that it can outperform them in some cases.

[1]  Brian D. Davison,et al.  Identifying link farm spam pages , 2005, WWW '05.

[2]  Alex Alves Freitas,et al.  cAnt-Miner: An Ant Colony Classification Algorithm to Cope with Continuous Attributes , 2008, ANTS Conference.

[3]  Zhihua Cui,et al.  Swarm Intelligence and Bio-Inspired Computation: Theory and Applications , 2013 .

[4]  Izzat Alsmadi,et al.  Using Machine Learning Algorithms to Detect Content-based Arabic Web Spam , 2012 .

[5]  Akebo Yamakami,et al.  Artificial Neural Networks For Content-based Web Spam Detection , 2012 .

[6]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[7]  Chunheng Wang,et al.  Boosting the Performance of Web Spam Detection with Ensemble Under-Sampling Classification , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[8]  Jiawei Han,et al.  Survey on web spam detection: principles and algorithms , 2012, SKDD.

[9]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[10]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[11]  Arnon Rungsawang,et al.  Spam Host Detection Using Ant Colony Optimization , 2011 .

[12]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[13]  Alex A. Freitas,et al.  Ant Colony Algorithms for Data Classification , 2009 .

[14]  Albert Y. Zomaya,et al.  A particle swarm based hybrid system for imbalanced medical data sampling , 2009, BMC Genomics.

[15]  Akebo Yamakami,et al.  An Analysis of Machine Learning Methods for Spam Host Detection , 2012, 2012 11th International Conference on Machine Learning and Applications.

[16]  Li Shengen,et al.  Generating New Features Using Genetic Programming to Detect Link Spam , 2011, 2011 Fourth International Conference on Intelligent Computation Technology and Automation.