Detecting Spam Tweets using Character N-gram Features

Twitter popularity made it an important and instantaneous source of news and trending events around the world. It has attracted the attention of spammers who post malicious content embedded in tweets and in their profile pages. Spammers use different and evolving techniques to evade traditional security mechanisms, and that creates the need to develop robust solutions that adapt with these techniques. In this paper, we propose using a low-level character n-grams feature that avoids the use of tokenizers or any language dependent tools. Using a publicly available dataset, we evaluate the performance of multiple ma-chine learning classifiers with different representations of the proposed feature. Our experiments show that our approach is an enhancement over the approaches that use word n-grams from tweet tokens. We also show that our technique can detect spam tweets with low latency which is crucial in a real-time environment like twitter.

[1]  Abdullah Talha Kabakus,et al.  A Survey of Spam Detection Methods on Twitter , 2017 .

[2]  Po-Ching Lin,et al.  A study of effective features for detecting long-surviving Twitter spam accounts , 2013, 2013 15th International Conference on Advanced Communications Technology (ICACT).

[3]  Georgios Paliouras,et al.  Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach , 2000, ArXiv.

[4]  Virgílio A. F. Almeida,et al.  Detecting Spammers on Twitter , 2010 .

[5]  Gianluca Stringhini,et al.  Detecting spammers on social networks , 2010, ACSAC '10.

[6]  Jong Kim,et al.  Spam Filtering in Twitter Using Sender-Receiver Relationship , 2011, RAID.

[7]  Efstathios Stamatatos,et al.  Words versus Character n-Grams for Anti-Spam Filtering , 2007, Int. J. Artif. Intell. Tools.

[8]  Soroush Vosoughi,et al.  Tweet2Vec: Learning Tweet Embeddings Using Character-level CNN-LSTM Encoder-Decoder , 2016, SIGIR.

[9]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[10]  Juan Martínez-Romo,et al.  Detecting malicious tweets in trending topics using a statistical analysis of language , 2013, Expert Syst. Appl..

[11]  Xiao Chen,et al.  6 million spam tweets: A large ground truth for timely Twitter spam detection , 2015, 2015 IEEE International Conference on Communications (ICC).

[12]  G. Gee,et al.  Twitter Spammer Profile Detection , 2010 .

[13]  Kyumin Lee,et al.  Uncovering social spammers: social honeypots + machine learning , 2010, SIGIR.

[14]  Arkaitz Zubiaga,et al.  Making the Most of Tweet-Inherent Features for Social Spam Detection on Twitter , 2015, #MSM.

[15]  Wei Hu,et al.  Twitter spammer detection using data stream clustering , 2014, Inf. Sci..

[16]  Kyumin Lee,et al.  Seven Months with the Devils: A Long-Term Study of Content Polluters on Twitter , 2011, ICWSM.

[17]  Chao Yang,et al.  Empirical Evaluation and New Design for Fighting Evolving Twitter Spammers , 2011, IEEE Transactions on Information Forensics and Security.

[18]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[19]  Alex Hai Wang,et al.  Don't follow me: Spam detection in Twitter , 2010, 2010 International Conference on Security and Cryptography (SECRYPT).

[20]  Chao Yang,et al.  CATS: Characterizing automation of Twitter spammers , 2013, 2013 Fifth International Conference on Communication Systems and Networks (COMSNETS).

[21]  Claire Cardie,et al.  Finding Deceptive Opinion Spam by Any Stretch of the Imagination , 2011, ACL.

[22]  Cícero Nogueira dos Santos,et al.  Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts , 2014, COLING.