Using supervised machine learning algorithms to detect suspicious URLs in online social networks

The increasing volume of malicious content in social networks requires automated methods to detect and eliminate such content. This paper describes a supervised machine learning classification model that has been built to detect the distribution of malicious content in online social networks (ONSs). Multisource features have been used to detect social network posts that contain malicious Uniform Resource Locators (URLs). These URLs could direct users to websites that contain malicious content, drive-by download attacks, phishing, spam, and scams. For the data collection stage, the Twitter streaming application programming interface (API) was used and VirusTotal was used for labelling the dataset. A random forest classification model was used with a combination of features derived from a range of sources. The random forest model without any tuning and feature selection produced a recall value of 0.89. After further investigation and applying parameter tuning and feature selection methods, however, we were able to improve the classifier performance to 0.92 in recall.

[1]  Pedro Ponce-Cruz,et al.  Intelligent Control Systems with LabVIEW , 2009 .

[2]  Ponnurangam Kumaraguru,et al.  Followers or Phantoms? An Anatomy of Purchased Twitter Followers , 2014, ArXiv.

[3]  Andrew Blake,et al.  Random Forest Classification for Automatic Delineation of Myocardium in Real-Time 3D Echocardiography , 2009, FIMH.

[4]  Paul E. Allen,et al.  Random Forest for improved analysis efficiency in passive acoustic monitoring , 2014, Ecol. Informatics.

[5]  Haining Wang,et al.  Detecting Social Spam Campaigns on Twitter , 2012, ACNS.

[6]  J. Doug Tygar,et al.  Adversarial machine learning , 2019, AISec '11.

[7]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[8]  Shrawan Kumar Trivedi,et al.  Effect of feature selection methods on machine learning classifiers for detecting email spams , 2013, RACS.

[9]  Huan Liu,et al.  Mining social media with social theories: a survey , 2014, SKDD.

[10]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[11]  Gerardo Canfora,et al.  A set of features to detect web security threats , 2016, Journal of Computer Virology and Hacking Techniques.

[12]  Erdong Chen,et al.  Facebook immune system , 2011, SNS '11.

[13]  Lorrie Faith Cranor,et al.  An Empirical Analysis of Phishing Blacklists , 2009, CEAS 2009.

[14]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[15]  Xiao Chen,et al.  6 million spam tweets: A large ground truth for timely Twitter spam detection , 2015, 2015 IEEE International Conference on Communications (ICC).

[16]  Omer F. Rana,et al.  Real-time classification of malicious URLs on Twitter using machine activity data , 2015, 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[17]  Vern Paxson,et al.  @spam: the underground on 140 characters or less , 2010, CCS '10.

[18]  Markus Strohmaier,et al.  Short links under attack: geographical analysis of spam in a URL shortener network , 2012, HT '12.

[19]  Yu Wang,et al.  An Ensemble Learning Approach for Addressing the Class Imbalance Problem in Twitter Spam Detection , 2016, ACISP.

[20]  Gianluca Stringhini,et al.  Detecting spammers on social networks , 2010, ACSAC '10.

[21]  Zheyi Chen,et al.  Detecting spammers on social networks , 2015, Neurocomputing.

[22]  Yudong Zhang,et al.  Binary PSO with mutation operator for feature selection using decision tree applied to spam detection , 2014, Knowl. Based Syst..

[23]  Arjun Mukherjee,et al.  Analyzing and Detecting Opinion Spam on a Large-scale Dataset via Temporal and Spatial Patterns , 2015, ICWSM.

[24]  David M. Nicol,et al.  The Koobface botnet and the rise of social malware , 2010, 2010 5th International Conference on Malicious and Unwanted Software.

[25]  Qianjia Huang,et al.  Cyber Bullying Detection Using Social and Textual Analysis , 2014, SAM '14.

[26]  Thamar Solorio,et al.  Lexical feature based phishing URL detection using online learning , 2010, AISec '10.

[27]  Yunqian Ma,et al.  Imbalanced Learning: Foundations, Algorithms, and Applications , 2013 .

[28]  Carla E. Brodley,et al.  Pruning Decision Trees with Misclassification Costs , 1998, ECML.

[29]  Qiang Yang,et al.  SMS Spam Detection Using Noncontent Features , 2012, IEEE Intelligent Systems.

[30]  Gilles Louppe,et al.  Understanding variable importances in forests of randomized trees , 2013, NIPS.

[31]  Julian Jang,et al.  A survey of emerging threats in cybersecurity , 2014, J. Comput. Syst. Sci..

[32]  Chao Yang,et al.  A taste of tweets: reverse engineering Twitter spammers , 2014, ACSAC.

[33]  Dawn Xiaodong Song,et al.  Design and Evaluation of a Real-Time URL Spam Filtering Service , 2011, 2011 IEEE Symposium on Security and Privacy.

[34]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[35]  Walmir M. Caminhas,et al.  A review of machine learning approaches to Spam filtering , 2009, Expert Syst. Appl..

[36]  Yu Wang,et al.  Statistical Features-Based Real-Time Detection of Drifted Twitter Spam , 2017, IEEE Transactions on Information Forensics and Security.

[37]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[38]  Ponnurangam Kumaraguru,et al.  PhishAri : Automatic Realtime Phishing Detection on Twitter Anupama Aggarwal , 2012 .

[39]  Chao Yang,et al.  Empirical Evaluation and New Design for Fighting Evolving Twitter Spammers , 2011, IEEE Transactions on Information Forensics and Security.

[40]  M. Chuah,et al.  Spam Detection on Twitter Using Traditional Classifiers , 2011, ATC.

[41]  Hua Shen,et al.  Detecting Spammers on Twitter Based on Content and Social Interaction , 2015, 2015 International Conference on Network and Information Systems for Computers.

[42]  Jong Kim,et al.  WarningBird: A Near Real-Time Detection System for Suspicious URLs in Twitter Stream , 2013, IEEE Transactions on Dependable and Secure Computing.