Towards filtering undesired short text messages using an online learning approach with semantic indexing

A new classifier is presented to detect undesired short text comments.The proposed approach is light, fast, multinomial and offers incremental learning.The impact of applying text normalization and semantic indexing is studied.The results indicate the proposed techniques outperformed most of the approaches.Text normalization and semantic indexing enhanced the classifiers performance. The popularity and reach of short text messages commonly used in electronic communication have led spammers to use them to propagate undesired content. This is often composed by misleading information, advertisements, viruses, and malwares that can be harmful and annoying to users. The dynamic nature of spam messages demands for knowledge-based systems with online learning and, therefore, the most traditional text categorization techniques can not be used. In this study, we introduce the MDLText, a text classifier based on the minimum description length principle, to the context of filtering undesired short text messages. The proposed approach supports incremental learning and, therefore, its predictive model is scalable and can adapt to continuously evolving spamming techniques. It is also fast, with computational cost increasing linearly with the number of samples and features, which is very desirable for expert systems applied to real-time electronic communication. In addition to the dynamic nature of these messages, they are also short and usually poorly written, rife with slangs, symbols, and abbreviations that difficult text representation, learning, and filtering. In this scenario, we also investigated the benefits of using text normalization and semantic indexing techniques. We showed these techniques can improve the text content quality and, consequently, enhance the performance of the expert systems for spamming detection. Based on these findings, we propose a new hybrid ensemble approach that combines the predictions obtained by the classifiers using the original text samples along with their variations created by applying text normalization and semantic indexing techniques. It has the advantages of being independent of the classification method and the results indicated it is efficient to filter undesired short text messages.

[1]  Claire Cardie,et al.  Negative Deceptive Opinion Spam , 2013, NAACL.

[2]  Naomie Salim,et al.  Detection of fake opinions using time series , 2016, Expert Syst. Appl..

[3]  Steven C. H. Hoi,et al.  LIBOL: a library for online learning algorithms , 2014, J. Mach. Learn. Res..

[4]  Gilad Mishne,et al.  Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.

[5]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[6]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[7]  A AlmeidaTiago,et al.  Text normalization and semantic indexing to enhance Instant Messaging and SMS spam filtering , 2016 .

[8]  Akebo Yamakami,et al.  Facing the spammers: A very effective approach to avoid junk e-mails , 2012, Expert Syst. Appl..

[9]  Koby Crammer,et al.  Exact Convex Confidence-Weighted Learning , 2008, NIPS.

[10]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[11]  W. John Wilbur,et al.  The ineffectiveness of within-document term frequency in text classification , 2008, Information Retrieval.

[12]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[13]  V. Potdar,et al.  A survey of awareness, knowledge and perception of online spam , 2012, 2012 7th International Conference on Computing and Convergence Technology (ICCCT).

[14]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[15]  Blaz Zupan,et al.  Spam Filtering Using Statistical Data Compression Models , 2006, J. Mach. Learn. Res..

[16]  Ian Witten,et al.  Data Mining , 2000 .

[17]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[18]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[19]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[20]  Tiago A. Almeida,et al.  Post or Block? Advances in Automatically Filtering Undesired Comments , 2015, J. Intell. Robotic Syst..

[21]  Yi Li,et al.  The Relaxed Online Maximum Margin Algorithm , 1999, Machine Learning.

[22]  José Mario García Valdez,et al.  A comparative study of machine learning techniques in blog comments spam filtering , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[23]  Huan Liu,et al.  Online Social Spammer Detection , 2014, AAAI.

[24]  Ee-Peng Lim,et al.  Detecting product review spammers using rating behaviors , 2010, CIKM.

[25]  Gilad Mishne,et al.  Leave a Reply: An Analysis of Weblog Comments , 2006 .

[26]  Claudio Gentile,et al.  A Second-Order Perceptron Algorithm , 2002, SIAM J. Comput..

[27]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[28]  Koby Crammer,et al.  Adaptive regularization of weight vectors , 2009, Machine Learning.

[29]  Yuhong Yang,et al.  Information Theory, Inference, and Learning Algorithms , 2005 .

[30]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[31]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[32]  D. Sculley,et al.  Relaxed online SVMs for spam filtering , 2007, SIGIR.

[33]  AbdulMalik S. Al-Salman,et al.  Combating Comment Spam with Machine Learning Approaches , 2015, 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).

[34]  Claudio Gentile,et al.  A New Approximate Maximal Margin Classification Algorithm , 2002, J. Mach. Learn. Res..

[35]  Richa Singh,et al.  Automated Spam Detection in Short Text Messages , 2016 .

[36]  Xiaolong Wang,et al.  SVM-Based Spam Filter with Active and Online Learning , 2006, TREC.

[37]  Akebo Yamakami,et al.  Contributions to the study of SMS spam filtering: new collection and results , 2011, DocEng '11.

[38]  Archana Bhattarai,et al.  A Self-Supervised Approach to Comment Spam Detection Based on Content Analysis , 2011, Int. J. Inf. Secur. Priv..

[39]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[40]  Gordon V. Cormack,et al.  Spam filtering for short messages , 2007, CIKM '07.

[41]  S. Ergin,et al.  A novel framework for SMS spam filtering , 2012, 2012 International Symposium on Innovations in Intelligent Systems and Applications.

[42]  Rashedur M. Rahman,et al.  A data mining based spam detection system for YouTube , 2013, Eighth International Conference on Digital Information Management (ICDIM 2013).

[43]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[44]  Tao Ban,et al.  An autonomous online malicious spam email detection system using extended RBF network , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[45]  Jianping Wu,et al.  A Trust and Reputation based Anti-SPIM Method , 2008, IEEE INFOCOM 2008 - The 27th Conference on Computer Communications.

[46]  Koby Crammer,et al.  Confidence-Weighted Linear Classification for Text Categorization , 2012, J. Mach. Learn. Res..

[47]  Na Li,et al.  Detecting and filtering instant messaging spam - a global and personalized approach , 2005, 1st IEEE ICNP Workshop on Secure Network Protocols, 2005. (NPSec)..

[48]  Tiago A. Almeida,et al.  TubeSpam: Comment Spam Filtering on YouTube , 2015, 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).

[49]  David G. Stork,et al.  Pattern Classification , 1973 .

[50]  Akebo Yamakami,et al.  MDLText: An efficient and lightweight text classifier , 2017, Knowl. Based Syst..

[51]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[52]  Tiago A. Almeida,et al.  Text normalization and semantic indexing to enhance Instant Messaging and SMS spam filtering , 2016, Knowl. Based Syst..

[53]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[54]  Pedro M. Domingos,et al.  Adversarial classification , 2004, KDD.

[55]  Donghai Guan,et al.  Semi-supervised learning using frequent itemset and ensemble learning for SMS classification , 2015, Expert Syst. Appl..

[56]  Peter Grünwald,et al.  A tutorial introduction to the minimum description length principle , 2004, ArXiv.

[57]  Ashish Sureka,et al.  Contextual feature based one-class classifier approach for detecting video response spam on YouTube , 2013, 2013 Eleventh Annual Conference on Privacy, Security and Trust.

[58]  Claire Cardie,et al.  Finding Deceptive Opinion Spam by Any Stretch of the Imagination , 2011, ACL.

[59]  Jieping Ye,et al.  Online learning by ellipsoid method , 2009, ICML '09.