Making the Most of Tweet-Inherent Features for Social Spam Detection on Twitter

Social spam produces a great amount of noise on social media services such as Twitter, which reduces the signal-to-noise ratio that both end users and data mining applications observe. Existing techniques on social spam detection have focused primarily on the identification of spam accounts by using extensive historical and network-based data. In this paper we focus on the detection of spam tweets, which optimises the amount of data that needs to be gathered by relying only on tweet-inherent features. This enables the application of the spam detection system to a large set of tweets in a timely fashion, potentially applicable in a real-time or near real-time setting. Using two large hand-labelled datasets of tweets containing spam, we study the suitability of five classification algorithms and four different feature sets to the social spam detection task. Our results show that, by using the limited set of features readily available in a tweet, we can achieve encouraging results which are competitive when compared against existing spammer detection systems that make use of additional, costly user features. Our study is the first that attempts at generalising conclusions on the optimal classifiers and sets of features for social spam detection over different datasets.

[1]  Arkaitz Zubiaga,et al.  Towards Detecting Rumours in Social Media , 2015, AAAI Workshop: AI for Cities.

[2]  Juan Martínez-Romo,et al.  Detecting malicious tweets in trending topics using a statistical analysis of language , 2013, Expert Syst. Appl..

[3]  M. Chuah,et al.  Spam Detection on Twitter Using Traditional Classifiers , 2011, ATC.

[4]  Andreas Heinz,et al.  Twitter psychosis: a rare variation or a distinct syndrome? , 2014, The Journal of nervous and mental disease.

[5]  Bo Pang,et al.  The effect of wording on message propagation: Topic- and author-controlled natural experiments on Twitter , 2014, ACL.

[6]  Virgílio A. F. Almeida,et al.  Detecting Spammers on Twitter , 2010 .

[7]  Hsinchun Chen,et al.  AI and Opinion Mining , 2010, IEEE Intelligent Systems.

[8]  Chao Yang,et al.  Empirical Evaluation and New Design for Fighting Evolving Twitter Spammers , 2011, IEEE Transactions on Information Forensics and Security.

[9]  Saif Mohammad,et al.  NRC-Canada: Building the State-of-the-Art in Sentiment Analysis of Tweets , 2013, *SEMEVAL.

[10]  Wei Hu,et al.  Twitter spammer detection using data stream clustering , 2014, Inf. Sci..

[11]  Kyumin Lee,et al.  Seven Months with the Devils: A Long-Term Study of Content Polluters on Twitter , 2011, ICWSM.

[12]  Lluís Màrquez i Villodre,et al.  Boosting Trees for Anti-Spam Email Filtering , 2001, ArXiv.

[13]  Enrico Blanzieri,et al.  A survey of learning-based techniques of email spam filtering , 2008, Artificial Intelligence Review.

[14]  Filippo Menczer,et al.  The rise of social bots , 2014, Commun. ACM.

[15]  Igor Santos,et al.  Twitter Content-Based Spam Filtering , 2013, SOCO-CISIS-ICEUTE.

[16]  Finn Årup Nielsen,et al.  A New ANEW: Evaluation of a Word List for Sentiment Analysis in Microblogs , 2011, #MSM.

[17]  Alex Hai Wang,et al.  Don't follow me: Spam detection in Twitter , 2010, 2010 International Conference on Security and Cryptography (SECRYPT).

[18]  Rossano Schifanella,et al.  People Are Strange When You're a Stranger: Impact and Influence of Bots on Social Networks , 2012, ICWSM.

[19]  Bing Liu Sentiment Analysis , 2020 .

[20]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[21]  Janyce Wiebe,et al.  Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis , 2005, HLT.