Novel Features for Web Spam Detection

Recent research on web spam detection has shown promising results, and many new and efficient detection algorithms have been developed. While most research focuses on developing algorithms, our investigation shows that the features used in the algorithms are in fact very important, and different features can lead to very different results. This paper investigates three types of web spam, content-based, link-based and cloaking, and introduces new features for identifying the three types of spam. Our experimental results show that the introduction of new features significantly improves the detection performance.

[1]  Juan Martínez-Romo,et al.  Web Spam Detection: New Classification Features Based on Qualified Link Analysis and Language Models , 2010, IEEE Transactions on Information Forensics and Security.

[2]  Jácint Szabó,et al.  Linked latent Dirichlet allocation in web spam filtering , 2009, AIRWeb '09.

[3]  Charu C. Aggarwal,et al.  Mining Text Data , 2012, Springer US.

[4]  Brian D. Davison,et al.  Web Spam Challenge , 2007 .

[5]  安藤 寛,et al.  Cross-Validation , 1952, Encyclopedia of Machine Learning and Data Mining.

[6]  R. Ciupa,et al.  International Conference , 2023, In Vitro Cellular & Developmental Biology - Animal.

[7]  Shahram Khadivi,et al.  Web spam detection based on discriminative content and link features , 2010, 2010 5th International Symposium on Telecommunications.

[8]  Akebo Yamakami,et al.  Artificial Neural Networks For Content-based Web Spam Detection , 2012 .

[9]  Ludovic Denoyer,et al.  Web spam challenge 2008 , 2008, AIRWeb 2008.

[10]  Santosh Kumar,et al.  A Machine Learning Based Web Spam Filtering Approach , 2016, 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA).

[11]  András A. Benczúr,et al.  Web spam classification: a few features worth more , 2011, WebQuality '11.

[12]  Gopal Behera,et al.  Privacy preserving C4.5 using Gini index , 2011, 2011 2nd National Conference on Emerging Trends and Applications in Computer Science.

[13]  Nino Antulov-Fantulin,et al.  ECML-PKDD 2011 Discovery Challenge Overview , 2011 .

[14]  S. P. Algur,et al.  Hybrid spamicity score approach to web spam detection , 2012, International Conference on Pattern Recognition, Informatics and Medical Engineering (PRIME-2012).

[15]  Giles M. Foody,et al.  Multiclass and Binary SVM Classification: Implications for Training and Classification Users , 2008, IEEE Geoscience and Remote Sensing Letters.

[16]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[17]  Ashutosh Kumar Singh,et al.  Multilayer perceptrons neural network based Web spam detection application , 2013, 2013 IEEE China Summit and International Conference on Signal and Information Processing.

[18]  András A. Benczúr,et al.  Web spam filtering in internet archives , 2009, AIRWeb '09.

[19]  Jiawei Han,et al.  Survey on web spam detection: principles and algorithms , 2012, SKDD.

[20]  Li Shengen,et al.  Generating New Features Using Genetic Programming to Detect Link Spam , 2011, 2011 Fourth International Conference on Intelligent Computation Technology and Automation.

[21]  Brian D. Davison,et al.  Identifying link farm spam pages , 2005, WWW '05.

[22]  Arnon Rungsawang,et al.  Web Spam Detection Using Link-Based Ant Colony Optimization , 2012, 2012 IEEE 26th International Conference on Advanced Information Networking and Applications.