An effective feature selection method for web spam detection

Abstract Web spam is an illegal and immoral way to increase the ranking of web pages by deceiving search engine algorithms. Therefore, different methods have been proposed to detect and improve the quality of results. Since a web page can be viewed from two aspects of the content and the link, the number of extracting features is high. Thus, selection of features with high separating ability can be considered as a preprocessing step in order to decrease computational time and cost. In this study, a new backward elimination approach is proposed for feature selection. The main idea of this method is measuring the impact of eliminating a set of features on the performance of a classifier instead of a single feature which is similar to the sequential backward selection. This method seeks for the largest feature subset that their omission from whole set features not only reduces the efficiency of the classifier but also improves it. Implementations on WEBSPAM-UK2007 dataset with Naive Bayes classifier show that the proposed method selects fewer features in comparison with other methods and improves the performance of the classifier in the IBA index about 7%.

[1]  Ali A. Ghorbani,et al.  Detecting Malicious URLs Using Lexical Analysis , 2016, NSS.

[2]  Juan Martínez-Romo,et al.  Web Spam Detection: New Classification Features Based on Qualified Link Analysis and Language Models , 2010, IEEE Transactions on Information Forensics and Security.

[3]  Luca Becchetti,et al.  A reference collection for web spam , 2006, SIGF.

[4]  Vijay Kumar,et al.  Astrophysics inspired multi-objective approach for automatic clustering and feature selection in real-life environment , 2018, Modern Physics Letters B.

[5]  Florentino Fernández Riverola,et al.  A dynamic model for integrating simple web spam classification techniques , 2015, Expert Syst. Appl..

[6]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[7]  Seyed Naser Razavi,et al.  A Survey of Web Spam Detection Techniques , 2014 .

[8]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[9]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[10]  Florentino Fernández Riverola,et al.  WSF2: A Novel Framework for Filtering Web Spam , 2016, Sci. Program..

[11]  Rong Huang,et al.  Web spam classification method based on deep belief networks , 2018, Expert Syst. Appl..

[12]  Yongli Wang,et al.  A systematic framework to discover pattern for web spam classification , 2017, 2017 8th IEEE Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON).

[13]  Ashish Chandra,et al.  Low cost page quality factors to detect web spam , 2014, ArXiv.

[14]  Ashutosh Kumar Singh,et al.  Comprehensive Literature Review on Machine Learning Structures for Web Spam Classification , 2015 .

[15]  Lina A. Abuwardih Towards Evaluating Web Spam Threats and Countermeasures , 2018 .

[16]  Marcin Luckner,et al.  Stable web spam detection using features based on lexical items , 2014, Comput. Secur..

[17]  Seema Kolkur,et al.  Language Model Issues in Web Spam Detection , 2014 .

[18]  Tajinder Singh,et al.  Feature oriented fuzzy logic based web spam detection , 2017 .