A Machine Learning Based Web Spam Filtering Approach

Web spam has the effect of polluting search engine results and decreasing the usefulness of search engines.Web spam can be classified according to the methods used to raise the web page's ranking by subverting web search engine's algorithms used to rank search results. The main types are: content spam, link spam and cloaking spam. There has been little or no work on automatically classifying web spam by type. This paper has two contributions, (i) we propose a Dual-Margin Multi-Class Hypersphere Support Vector Machine (DMMH- SVM) classifier approach to automatically classifying web spam by type, (ii) we introduce novel cloaking-based spam features which help our classifier model to achieve high precision and recall rate, thereby reducing the false positive rates. The effectiveness of the proposed model is justified analytically. Our experimental results demonstrated that DMMH-SVM outperforms existing algorithms with novel cloaking features.

[1]  Minoru Sasaki,et al.  Spam detection using text clustering , 2005, 2005 International Conference on Cyberworlds (CW'05).

[2]  Brian D. Davison Recognizing Nepotistic Links on the Web , 2000 .

[3]  Jason D. M. Rennie Improving multi-class text classification with Naive Bayes , 2001 .

[4]  Brian D. Davison,et al.  Winnowing wheat from the chaff: propagating trust to sift spam from the web , 2007, SIGIR.

[5]  Luca Becchetti,et al.  Link analysis for Web spam detection , 2008, TWEB.

[6]  Jieping Ye,et al.  A Small Sphere and Large Margin Approach for Novelty Detection Using Training Data with Outliers , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Gunnar Rätsch,et al.  Support Vector Machines and Kernels for Computational Biology , 2008, PLoS Comput. Biol..

[8]  Brian D. Davison,et al.  Adversarial Web Search , 2011, Found. Trends Inf. Retr..

[9]  Wolfgang Nejdl,et al.  Site level noise removal for search engines , 2006, WWW '06.

[10]  Tie-Yan Liu,et al.  Detecting Link Spam Using Temporal Information , 2006, Sixth International Conference on Data Mining (ICDM'06).

[11]  Tim Oates,et al.  Detecting Spam Blogs: A Machine Learning Approach , 2006, AAAI.

[12]  Foster Provost,et al.  Suspicion scoring based on guilt-by-association, colle ctive inference, and focused data access 1 , 2005 .

[13]  Lise Getoor,et al.  Link-based Classifi-cation using Labeled and Unlabeled Data , 2003 .

[14]  Claire Cardie,et al.  Finding Deceptive Opinion Spam by Any Stretch of the Imagination , 2011, ACL.

[15]  Krishna Bharat,et al.  When experts agree: using non-affiliated experts to rank popular topics , 2001, TOIS.

[16]  Brian D. Davison,et al.  Cloaking and Redirection: A Preliminary Study , 2005, AIRWeb.

[17]  Shiliang Sun,et al.  Multitask multiclass support vector machines: Model and experiments , 2013, Pattern Recognit..

[18]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[19]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[20]  David Carmel,et al.  The connectivity sonar: detecting site functionality by structural patterns , 2003, HYPERTEXT '03.

[21]  Trung Le,et al.  An optimal sphere and two large margins approach for novelty detection , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[22]  Constantine D. Spyropoulos,et al.  An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[23]  Yiming Yang,et al.  Recursive regularization for large-scale classification with hierarchical and graphical dependencies , 2013, KDD.

[24]  Carlos Castillo,et al.  Web spam identification through content and hyperlinks , 2008, AIRWeb '08.

[25]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[26]  Marc Najork,et al.  Hits on the web: how does it compare? , 2007, SIGIR.

[27]  Zhi-Hua Zhou,et al.  A k-nearest neighbor based algorithm for multi-label classification , 2005, 2005 IEEE International Conference on Granular Computing.

[28]  Harold W. Kuhn,et al.  Nonlinear programming: a historical view , 1982, SMAP.

[29]  Jung-Hsien Chiang,et al.  A new maximal-margin spherical-structured multi-class support vector machine , 2009, Applied Intelligence.

[30]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[31]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[32]  Koby Crammer,et al.  On the Learnability and Design of Output Codes for Multiclass Problems , 2002, Machine Learning.

[33]  Izzat Alsmadi,et al.  A link and Content Hybrid Approach for Arabic Web Spam Detection , 2012 .

[34]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[35]  Brian D. Davison,et al.  Web Spam Challenge , 2007 .

[36]  Jun Guo,et al.  An Approach to Spam Detection by Naive Bayes Ensemble Based on Decision Induction , 2006, Sixth International Conference on Intelligent Systems Design and Applications.

[37]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[38]  J. Weston,et al.  Support Vector Machine Solvers , 2007 .

[39]  Brian D. Davison,et al.  Identifying link farm spam pages , 2005, WWW '05.

[40]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[41]  PolatKemal,et al.  A novel hybrid intelligent method based on C4.5 decision tree classifier and one-against-all approach for multi-class classification problems , 2009 .

[42]  Houkuan Huang,et al.  Feature selection for text classification with Naïve Bayes , 2009, Expert Syst. Appl..