A Topic-Based Hidden Markov Model for Real-Time Spam Tweets Filtering

Online social networks (OSNs) have become an important source of information for a tremendous range of applications and researches such as search engines, and summarization systems. However, the high usability and accessibility of OSNs have exposed many information quality (IQ) problems which consequently decrease the performance of the OSNs dependent applications. Social spammers are a particular kind of ill-intentioned users who degrade the quality of OSNs information through misusing all possible services provided by OSNs. Social spammers spread many intensive posts/tweets to lure legitimate users to malicious or commercial sites containing malware downloads, phishing, and drug sales. Given the fact that Twitter is not immune towards the social spam problem, different researchers have designed various detection methods which inspect individual tweets or accounts for the existence of spam contents. However, although of the high detection rates of the account-based spam detection methods, these methods are not suitable for filtering tweets in the real-time detection because of the need for information from Twitter’s servers. At tweet spam detection level, many light features have been proposed for real-time filtering; however, the existing classification models separately classify a tweet without considering the state of previous handled tweets associated with a topic. Also, these models periodically require retraining using a ground-truth data to make them up-to-date. Hence, in this paper, we formalize a Hidden Markov Model (HMM) as a time-dependent model for real-time topical spam tweets filtering. More precisely, our method only leverages the available and accessible meta-data in the tweet object to detect spam tweets exiting in a stream of tweets related to a topic (e.g., #Trump), with considering the state of previously handled tweets associated to the same topic. Compared to the classical time-independent classification methods such as Random Forest, the experimental evaluation demonstrates the efficiency of increasing the quality of topics in terms of precision, recall, and F-measure performance metrics.

[1]  Chao Yang,et al.  Empirical Evaluation and New Design for Fighting Evolving Twitter Spammers , 2011, IEEE Transactions on Information Forensics and Security.

[2]  Guofei Gu,et al.  Analyzing spammers' social networks for fun and profit: a case study of cyber criminal ecosystem on twitter , 2012, WWW.

[3]  Carol J. Fung,et al.  Enhancing Twitter spam accounts discovery using cross-account pattern mining , 2015, 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM).

[4]  Rodolfo Zunino,et al.  Spam detection of Twitter traffic: A framework based on random forests and non-uniform feature sampling , 2016, 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[5]  Jun Zhang,et al.  A Performance Evaluation of Machine Learning-Based Streaming Spam Tweets Detection , 2015, IEEE Transactions on Computational Social Systems.

[6]  Virgílio A. F. Almeida,et al.  Detecting Spammers on Twitter , 2010 .

[7]  Zoubin Ghahramani,et al.  An Introduction to Hidden Markov Models and Bayesian Networks , 2001, Int. J. Pattern Recognit. Artif. Intell..

[8]  Jun Zhang,et al.  Spammers Are Becoming "Smarter" on Twitter , 2016, IT Professional.

[9]  Florence Sèdes,et al.  A Case Study on the Influence of the User Profile Enrichment on Buzz Propagation in Social Media: Experiments on Delicious , 2015, ADBIS.

[10]  Sushil Jajodia,et al.  Detecting Automation of Twitter Accounts: Are You a Human, Bot, or Cyborg? , 2012, IEEE Transactions on Dependable and Secure Computing.

[11]  Juan Martínez-Romo,et al.  Detecting malicious tweets in trending topics using a statistical analysis of language , 2013, Expert Syst. Appl..

[12]  M. Chuah,et al.  Spam Detection on Twitter Using Traditional Classifiers , 2011, ATC.

[13]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[14]  Florence Sèdes,et al.  Detecting sociosemantic communities by applying social network analysis in tweets , 2015, Social Network Analysis and Mining.

[15]  Alex Hai Wang,et al.  Don't follow me: Spam detection in Twitter , 2010, 2010 International Conference on Security and Cryptography (SECRYPT).

[16]  Florence Sèdes,et al.  Leveraging time for spammers detection on Twitter , 2016, MEDES.

[17]  Calton Pu,et al.  BEAN: A BEhavior ANalysis Approach of URL Spam Filtering in Twitter , 2015, 2015 IEEE International Conference on Information Reuse and Integration.

[18]  Jun Zhang,et al.  Asymmetric self-learning for tackling Twitter Spam Drift , 2015, 2015 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[19]  Mark Stamp,et al.  A Revealing Introduction to Hidden Markov Models , 2017 .

[20]  Nitin Agarwal,et al.  Information quality challenges in social media , 2010, ICIQ.

[21]  Alex 'Sandy' Pentland,et al.  If it looks like a spammer and behaves like a spammer, it must be a spammer: analysis and detection of microblogging spam accounts , 2016, International Journal of Information Security.

[22]  Jun Zhang,et al.  Twitter spam detection based on deep learning , 2017, ACSW.

[23]  Florence Sèdes,et al.  Dynamic enrichment of social users' interests , 2014, 2014 IEEE Eighth International Conference on Research Challenges in Information Science (RCIS).

[24]  Kyumin Lee,et al.  Uncovering social spammers: social honeypots + machine learning , 2010, SIGIR.

[25]  James Caverlee,et al.  Detecting Spam URLs in Social Media via Behavioral Analysis , 2015, ECIR.

[26]  Haining Wang,et al.  Detecting Social Spam Campaigns on Twitter , 2012, ACNS.

[27]  Gianluca Stringhini,et al.  Detecting spammers on social networks , 2010, ACSAC '10.