Real-time classification of malicious URLs on Twitter using machine activity data

Massive online social networks with hundreds of millions of active users are increasingly being used by Cyber criminals to spread malicious software (malware) to exploit vulnerabilities on the machines of users for personal gain. Twitter is particularly susceptible to such activity as, with its 140 character limit, it is common for people to include URLs in their tweets to link to more detailed information, evidence, news reports and so on. URLs are often shortened so the endpoint is not obvious before a person clicks the link. Cyber criminals can exploit this to propagate malicious URLs on Twitter, for which the endpoint is a malicious server that performs unwanted actions on the person's machine. This is known as a drive-by-download. In this paper we develop a machine classification system to distinguish between malicious and benign URLs within seconds of the URL being clicked (i.e. `real-time'). We train the classifier using machine activity logs created while interacting with URLs extracted from Twitter data collected during a large global event - the Superbowl - and test it using data from another large sporting event - the Cricket World Cup. The results show that machine activity logs produce precision performances of up to 0.975 on training data from the first event and 0.747 on a test data from a second event. Furthermore, we examine the properties of the learned model to explain the relationship between machine activity and malicious software behaviour, and build a learning curve for the classifier to illustrate that very small samples of training data can be used with only a small detriment to performance.

[1]  Xuxian Jiang,et al.  Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites That Exploit Browser Vulnerabilities , 2006, NDSS.

[2]  David M. Nicol,et al.  The Koobface botnet and the rise of social malware , 2010, 2010 5th International Conference on Malicious and Unwanted Software.

[3]  Christopher Krügel,et al.  Revolver: An Automated Approach to the Detection of Evasive Web-based Malware , 2013, USENIX Security Symposium.

[4]  Vincent A. Knight,et al.  Tweeting the terror: modelling the social media reaction to the Woolwich terrorist attack , 2014, Social Network Analysis and Mining.

[5]  Danah Boyd,et al.  Detecting Spam in a Twitter Network , 2009, First Monday.

[6]  Chung-Hong Lee Unsupervised and supervised learning to evaluate event relatedness based on content mining from social-media streams , 2012, Expert Syst. Appl..

[7]  Kyumin Lee,et al.  Uncovering social spammers: social honeypots + machine learning , 2010, SIGIR.

[8]  Giovanni Vigna,et al.  Prophiler: a fast filter for the large-scale detection of malicious web pages , 2011, WWW.

[9]  Hossein Saidi,et al.  Malware propagation in Online Social Networks , 2009, 2009 4th International Conference on Malicious and Unwanted Software (MALWARE).

[10]  Gianluca Stringhini,et al.  Poultry markets: on the underground economy of twitter followers , 2012, CCRV.

[11]  Peter Komisarczuk,et al.  Challenges in developing Capture-HPC exclusion lists , 2014, SIN.

[12]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[13]  Omer F. Rana,et al.  Honeyware: A Web-Based Low Interaction Client Honeypot , 2010, 2010 Third International Conference on Software Testing, Verification, and Validation Workshops.

[14]  Christopher Krügel,et al.  Detection and analysis of drive-by-download attacks and malicious JavaScript code , 2010, WWW '10.

[15]  Juan Martínez-Romo,et al.  Detecting malicious tweets in trending topics using a statistical analysis of language , 2013, Expert Syst. Appl..

[16]  Gianluca Stringhini,et al.  Detecting spammers on social networks , 2010, ACSAC '10.

[17]  Guofei Gu,et al.  Analyzing spammers' social networks for fun and profit: a case study of cyber criminal ecosystem on twitter , 2012, WWW.

[18]  Jose Nazario,et al.  PhoneyC: A Virtual Client Honeypot , 2009, LEET.

[19]  Isabell M. Welpe,et al.  Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment , 2010, ICWSM.

[20]  R. Procter,et al.  Reading the riots on Twitter: methodological innovation for the analysis of big data , 2013 .

[21]  Jong Kim,et al.  WarningBird: Detecting Suspicious URLs in Twitter Stream , 2012, NDSS.

[22]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[23]  Ian Welch,et al.  Two-Stage Classification Model to Detect Malicious Web Pages , 2011, 2011 IEEE International Conference on Advanced Information Networking and Applications.

[24]  Rizal Setya Perdana What is Twitter , 2013 .

[25]  Peter Burnap,et al.  Making sense of self-reported socially significant data using computational methods , 2013 .

[26]  Alex Hai Wang,et al.  Machine Learning for the Detection of Spam in Twitter Networks , 2010, ICETE.

[27]  Kristina Lerman,et al.  Information Contagion: An Empirical Study of the Spread of News on Digg and Twitter Social Networks , 2010, ICWSM.

[28]  Will Webberley,et al.  Retweeting: A study of message-forwarding in twitter , 2011, 2011 Workshop on Mobile and Online Social Networks.

[29]  Shambhu J. Upadhyaya,et al.  The Early (tweet-ing) Bird Spreads the Worm: An Assessment of Twitter for Malware Propagation , 2012, ANT/MobiWIS.