TubeSpam: Comment Spam Filtering on YouTube

The profitability promoted by Google in its brand new video distribution platform YouTube has attracted an increasing number of users. However, such success has also attracted malicious users, which aim to self-promote their videos or disseminate viruses and malwares. Since YouTube offers limited tools for comment moderation, the spam volume is shockingly increasing which lead owners of famous channels to disable the comments section in their videos. Automatic comment spam filtering on YouTube is a challenge even for established classification methods, since the messages are very short and often rife with slangs, symbols and abbreviations. In this work, we have evaluated several top-performance classification techniques for such purpose. The statistical analysis of results indicate that, with 99.9% of confidence level, decision trees, logistic regression, Bernoulli Naive Bayes, random forests, linear and Gaussian SVMs are statistically equivalent. Based on this, we have also offered the TubeSpam - an accurate online system to filter comments posted on YouTube.

[1]  Gilad Mishne,et al.  Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.

[2]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[3]  Akebo Yamakami,et al.  Occam’s razor-based spam filter , 2012, Journal of Internet Services and Applications.

[4]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[5]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[6]  Gilad Mishne,et al.  Leave a Reply: An Analysis of Weblog Comments , 2006 .

[7]  Jurandy Almeida,et al.  Spam filtering: how the dimensionality reduction affects the accuracy of Naive Bayes classifiers , 2011, Journal of Internet Services and Applications.

[8]  Ashish Sureka,et al.  Contextual feature based one-class classifier approach for detecting video response spam on YouTube , 2013, 2013 Eleventh Annual Conference on Privacy, Security and Trust.

[9]  Alessandro Moschitti,et al.  Opinion Mining on YouTube , 2014, ACL.

[10]  José Mario García Valdez,et al.  A comparative study of machine learning techniques in blog comments spam filtering , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[11]  R. E. Lee,et al.  Distribution-free multiple comparisons between successive treatments , 1995 .

[12]  T. Silva Normalização textual e indexação semântica aplicadas da filtragem de SMS spam , 2016 .

[13]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[14]  Virgílio A. F. Almeida,et al.  Detecting Spammers and Content Promoters in Online Video Social Networks , 2009, IEEE INFOCOM Workshops 2009.

[15]  Calton Pu,et al.  A social-spam detection framework , 2011, CEAS '11.

[16]  Akebo Yamakami,et al.  On the Validity of a New SMS Spam Collection , 2012, 2012 11th International Conference on Machine Learning and Applications.

[17]  Akebo Yamakami,et al.  Artificial Neural Networks For Content-based Web Spam Detection , 2012 .

[18]  Mohak Shah,et al.  Evaluating Learning Algorithms: A Classification Perspective , 2011 .

[19]  Rashedur M. Rahman,et al.  A data mining based spam detection system for YouTube , 2013, Eighth International Conference on Digital Information Management (ICDIM 2013).

[20]  Blaz Zupan,et al.  Spam Filtering Using Statistical Data Compression Models , 2006, J. Mach. Learn. Res..

[21]  Haiying Shen,et al.  SOAP: A Social network Aided Personalized and effective spam filter to clean your e-mail box , 2011, 2011 Proceedings IEEE INFOCOM.