Spam host classification using PSO-SVM

Search engines have become a de facto place to start information acquisition on the Internet. Sabotaging the quality of the results retrieved by search engines can lead users to doubt the search engine provider. Spam websites can serve as means of phishing. This paper shows a spam host detection approach that uses support vector machines(SVM) for classification. We create a parallel version of standard Particle Swarm Optimization(PSO) to determine free parameters of the SVM classifier and apply our proposed model to a content web spamming dataset, WEBSPAM-UK2011. Our implementation of the parallel PSO is constructed on a pool of threads and each thread executes tasks associated to a particle from the swarm. Experiments showed that our proposed model can achieve a higher accuracy than regular SVM and outperforms other classifiers (C4.5, Naive Bayes). Furthermore, parallel version of standard Particle Swam Optimization(PSO) can efficiently select parameters for SVM.

[1]  Jaroslaw Sobieszczanski-Sobieski,et al.  A Parallel Particle Swarm Optimization Algorithm Accelerated by Asynchronous Evaluations , 2005 .

[2]  Xu Hong,et al.  A Real-time Intrusion Detection System Based on PSO-SVM , 2009 .

[3]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval: A Critical Review , 2008, Found. Trends Inf. Retr..

[4]  Russell C. Eberhart,et al.  A new optimizer using particle swarm theory , 1995, MHS'95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science.

[5]  Izzat Alsmadi,et al.  Using Machine Learning Algorithms to Detect Content-based Arabic Web Spam , 2012 .

[6]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[7]  Akebo Yamakami,et al.  Machine Learning Methods for Spamdexing Detection , 2013 .

[8]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[9]  Yiqun Liu,et al.  Identifying Web Spam with the Wisdom of the Crowds , 2012, TWEB.

[10]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[11]  Ho Gi Jung,et al.  Genetic Algorithm-Based Optimization of SVM-Based Pedestrian Classifier , 2007 .

[12]  Jiawei Han,et al.  Survey on web spam detection: principles and algorithms , 2012, SKDD.

[13]  Akebo Yamakami,et al.  Artificial Neural Networks For Content-based Web Spam Detection , 2012 .

[14]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[15]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval , 2008, NAACL.

[16]  Alex Talevski,et al.  Web Spambot Detection Based on Web Navigation Behaviour , 2010, 2010 24th IEEE International Conference on Advanced Information Networking and Applications.

[17]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .