Web spam detection using trust and distrust-based ant colony optimization learning

Purpose – This paper aims to present a machine learning approach for solving the problem of Web spam detection. Based on an adoption of the ant colony optimization (ACO), three algorithms are proposed to construct rule-based classifiers to distinguish between non-spam and spam hosts. Moreover, the paper also proposes an adaptive learning technique to enhance the spam detection performance. Design/methodology/approach – The Trust-ACO algorithm is designed to let an ant start from a non-spam seed, and afterwards, decide to walk through paths in the host graph. Trails (i.e. trust paths) discovered by ants are then interpreted and compiled to non-spam classification rules. Similarly, the Distrust-ACO algorithm is designed to generate spam classification ones. The last Combine-ACO algorithm aims to accumulate rules given from the former algorithms. Moreover, an adaptive learning technique is introduced to let ants walk with longer (or shorter) steps by rewarding them when they find desirable paths or penalizin...

[1]  Arnon Rungsawang,et al.  Adaptive Learning Ant Colony Optimization for Web Spam Detection , 2014, ICCSA.

[2]  Yiqun Liu,et al.  User behavior oriented web spam detection , 2008, WWW.

[3]  András A. Benczúr,et al.  Web spam filtering in internet archives , 2009, AIRWeb '09.

[4]  Bin Zhou,et al.  Effectively Detecting Content Spam on the Web Using Topical Diversity Measures , 2012, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[5]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[6]  Arnon Rungsawang,et al.  Web Spam Detection Using Link-Based Ant Colony Optimization , 2012, 2012 IEEE 26th International Conference on Advanced Information Networking and Applications.

[7]  Luca Maria Gambardella,et al.  Ant Algorithms for Discrete Optimization , 1999, Artificial Life.

[8]  Luca Maria Gambardella,et al.  Ant colony system: a cooperative learning approach to the traveling salesman problem , 1997, IEEE Trans. Evol. Comput..

[9]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[10]  Juan Martínez-Romo,et al.  Web Spam Detection: New Classification Features Based on Qualified Link Analysis and Language Models , 2010, IEEE Transactions on Information Forensics and Security.

[11]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[12]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[13]  Marcin Luckner,et al.  Stable web spam detection using features based on lexical items , 2014, Comput. Secur..

[14]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[15]  Luca Becchetti,et al.  A reference collection for web spam , 2006, SIGF.

[16]  Tie-Yan Liu,et al.  BrowseRank: letting web users vote for page importance , 2008, SIGIR '08.

[17]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[18]  Brian D. Davison,et al.  Propagating Trust and Distrust to Demote Web Spam , 2006, MTW.

[19]  Rashmi Raj,et al.  Web Spam Detection with Anti-Trust Rank , 2006, AIRWeb.

[20]  Marco Dorigo,et al.  Ant system: optimization by a colony of cooperating agents , 1996, IEEE Trans. Syst. Man Cybern. Part B.

[21]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[22]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[23]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[24]  Thomas Stützle,et al.  MAX-MIN Ant System , 2000, Future Gener. Comput. Syst..

[25]  Ashutosh Kumar Singh,et al.  Multilayer perceptrons neural network based Web spam detection application , 2013, 2013 IEEE China Summit and International Conference on Signal and Information Processing.

[26]  Luca Becchetti,et al.  Link-Based Characterization and Detection of Web Spam , 2006, AIRWeb.