Web Spam Detection Using Multiple Kernels in Twin Support Vector Machine

Search engines are the most important tools for web data acquisition. Web pages are crawled and indexed by search Engines. Users typically locate useful web pages by querying a search engine. One of the challenges in search engines administration is spam pages which waste search engine resources. These pages by deception of search engine ranking algorithms try to be showed in the first page of results. There are many approaches to web spam pages detection such as measurement of HTML code style similarity, pages linguistic pattern analysis and machine learning algorithm on page content features. One of the famous algorithms has been used in machine learning approach is Support Vector Machine (SVM) classifier. Recently basic structure of SVM has been changed by new extensions to increase robustness and classification accuracy. In this paper we improved accuracy of web spam detection by using two nonlinear kernels into Twin SVM (TSVM) as an improved extension of SVM. The classifier ability to data separation has been increased by using two separated kernels for each class of data. Effectiveness of new proposed method has been experimented with two publicly used spam datasets called UK-2007 and UK-2006. Results show the effectiveness of proposed kernelized version of TSVM in web spam page detection.

[1]  Abhishek Mathur,et al.  Content based web spam detection using naive bayes with different feature representation technique , 2013 .

[2]  Mahdieh Danandeh Oskouei,et al.  Web Spam Detection Inspired by the Immune System , 2015 .

[3]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[4]  Trevor Hastie,et al.  Support-Vector Machines and Kernel Methods , 2016 .

[5]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[6]  Brian D. Davison,et al.  Cloaking and Redirection: A Preliminary Study , 2005, AIRWeb.

[7]  Manasi Kulkarni,et al.  New Classification Method Based on Decision Tree for Web Spam Detection , 2014 .

[8]  András A. Benczúr,et al.  Cross-lingual web spam classification , 2013, WWW.

[9]  Behzad Moshiri,et al.  Designing a web spam classifier based on feature fusion in the Layered Multi-population Genetic Programming framework , 2013, Proceedings of the 16th International Conference on Information Fusion.

[10]  Yiqun Liu,et al.  Identifying web spam with user behavior analysis , 2008, AIRWeb '08.

[11]  Charles L. A. Clarke,et al.  Efficient and effective spam filtering and re-ranking for large web datasets , 2010, Information Retrieval.

[12]  Shifei Ding,et al.  An overview on twin support vector machines , 2012, Artificial Intelligence Review.

[13]  Akebo Yamakami,et al.  Artificial Neural Networks For Content-based Web Spam Detection , 2012 .

[14]  A. Kilgarriff Web spam , 2013 .

[15]  Thomas Lavergne,et al.  Tracking Web Spam with Hidden Style Similarity , 2006, AIRWeb.

[16]  Ashish Chandra,et al.  Web spam classification using supervised artificial neural network algorithms , 2015, ArXiv.

[17]  Alois Potton Spam , 2003, PIK Prax. Informationsverarbeitung Kommun..

[18]  Marc Najork,et al.  Spam, Damn Spam, and Statistics , 2004 .

[19]  Luca Becchetti,et al.  A reference collection for web spam , 2006, SIGF.

[20]  Ismail Hmeidi,et al.  Web Spam Detection Using Machine Learning in Specific Domain Features , 2008 .

[21]  Panagiotis Takis Metaxas,et al.  Web Spam, Propaganda and Trust , 2005, AIRWeb.

[22]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[23]  Charles L. A. Clarke,et al.  Information Retrieval - Implementing and Evaluating Search Engines , 2010 .

[24]  Luca Becchetti,et al.  Web Spam Detection : link-based and content-based techniques , 2007 .

[25]  Bernhard Schölkopf,et al.  Learning from labeled and unlabeled data on a directed graph , 2005, ICML.

[26]  Luca Becchetti,et al.  Link analysis for Web spam detection , 2008, TWEB.

[27]  Shahram Khadivi,et al.  Web Spam Detection: New Approach with Hidden Markov Models , 2013, AIRS.

[28]  Reshma Khemchandani,et al.  Twin Support Vector Machines for Pattern Classification , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.