A systematic framework to discover pattern for web spam classification

Web spam is a big problem for search engine users in World Wide Web. They use deceptive techniques to achieve high rankings. Although many researchers have presented the different approach for classification and web spam detection still it is an open issue in computer science. Analyzing and evaluating these websites can be an effective step for discovering and categorizing the features of these websites. There are several methods and algorithms for detecting those websites, such as decision tree algorithm. In this paper, we present a systematic framework based on CHAID algorithm and a modified string matching algorithm (KMP) for extract features and analysis of these websites. We evaluated our model and other methods with a dataset of Alexa Top 500 Global Sites and Bing search engine results in 500 queries.

[1]  Akebo Yamakami,et al.  Towards Web Spam Filtering Using a Classifier Based on the Minimum Description Length Principle , 2016, 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA).

[2]  Ali Harounabadi,et al.  Evaluation and Analysis of Popular Decision Tree Algorithms for Annoying Advertisement Websites Classification , 2015, 2015 Fifth International Conference on Communication Systems and Network Technologies.

[3]  G. V. Kass An Exploratory Technique for Investigating Large Quantities of Categorical Data , 1980 .

[4]  Li Shengen,et al.  Generating New Features Using Genetic Programming to Detect Link Spam , 2011, 2011 Fourth International Conference on Intelligent Computation Technology and Automation.

[5]  Ling Liu,et al.  Countering web spam with credibility-based link analysis , 2007, PODC '07.

[6]  Kevin S. McCurley,et al.  Ranking the web frontier , 2004, WWW '04.

[7]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[8]  Jiawei Han,et al.  Survey on web spam detection: principles and algorithms , 2012, SKDD.

[9]  Christian Platzer,et al.  Removing web spam links from search engine results , 2011, Journal in Computer Virology.

[10]  Filippo Geraci,et al.  Identification of Web Spam through Clustering of Website Structures , 2015, WWW.

[11]  Yiqun Liu,et al.  Identifying web spam with user behavior analysis , 2008, AIRWeb '08.

[12]  Brian D. Davison,et al.  Winnowing wheat from the chaff: propagating trust to sift spam from the web , 2007, SIGIR.

[13]  Tie-Yan Liu,et al.  Detecting Link Spam Using Temporal Information , 2006, Sixth International Conference on Data Mining (ICDM'06).

[14]  Beate Commentz-Walter,et al.  A String Matching Algorithm Fast on the Average , 1979, ICALP.

[15]  Yong Hu,et al.  A scalable intelligent non-content-based spam-filtering framework , 2010, Expert Syst. Appl..

[16]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[17]  Rashmi Raj,et al.  Web Spam Detection with Anti-Trust Rank , 2006, AIRWeb.

[18]  Florentino Fernández Riverola,et al.  A dynamic model for integrating simple web spam classification techniques , 2015, Expert Syst. Appl..

[19]  Ashutosh Kumar Singh,et al.  Link-based web spam detection using weight properties , 2014, Journal of Intelligent Information Systems.

[20]  Ashutosh Kumar Singh,et al.  Multilayer perceptrons neural network based Web spam detection application , 2013, 2013 IEEE China Summit and International Conference on Signal and Information Processing.

[21]  Fangzhao Wu,et al.  Co-detecting social spammers and spam messages in microblogging via exploiting social contexts , 2016, Neurocomputing.

[22]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[23]  Muhammad Abulaish,et al.  A generic statistical approach for spam detection in Online Social Networks , 2013, Comput. Commun..

[24]  Y. Zhao,et al.  Comparison of decision tree methods for finding active objects , 2007, 0708.4274.

[25]  Brian D. Davison,et al.  Topical TrustRank: using topicality to combat web spam , 2006, WWW '06.