A machine learning approach to detection of JavaScript-based attacks using AST features and paragraph vectors

Abstract Websites attract millions of visitors due to the convenience of services they offer, which provide for interesting targets for cyber attackers. Most of these websites use JavaScript (JS) to create dynamic content. The exploitation of vulnerabilities in servers, plugins, and other third-party systems enables the insertion of malicious codes into websites. These exploits use methods such as drive-by-downloads, pop up ads, and phishing attacks on news, porn, piracy, torrent or free software websites, among others. Many of the recent cyber-attacks exploit JS vulnerabilities, in some cases employing obfuscation to hide their maliciousness and evade detection. It is, therefore, primal to develop an accurate detection system for malicious JS to protect users from such attacks. This study adopts Abstract Syntax Tree (AST) for code structure representation and a machine learning approach to conduct feature learning called Doc2vec to address this issue. Doc2vec is a neural network model that can learn context information of texts with variable length. This model is a well-suited feature learning method for JS codes, which consist of text content ranging among single line sentences, paragraphs, and full-length documents. Besides, features learned with Doc2Vec are of low dimensions which ensure faster detections. A classifier model judges the maliciousness of a JS code using the learned features. The performance of this approach is evaluated using the D3M dataset (Drive-by-Download Data by Marionette) for malicious JS codes and the JSUNPACK plus Alexa top 100 websites datasets for benign JS codes. We then compare the performance of Doc2Vec on plain JS codes (Plain-JS) and AST form of JS codes (AST-JS) to other feature learning methods. Our experimental results show that the proposed AST features and Doc2Vec for feature learning provide better accuracy and fast classification in malicious JS codes detection compared to conventional approaches and can flag malicious JS codes previously identified as hard-to-detect.

[1]  Natheer Khasawneh,et al.  Analysis and Identification of Malicious JavaScript Code , 2012, Inf. Secur. J. A Glob. Perspect..

[2]  Sehun Kim,et al.  Two-Phase Malicious Web Page Detection Scheme Using Misuse and Anomaly Detection , 2014 .

[3]  Xia Feng,et al.  Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey , 2017, Multimedia Tools and Applications.

[4]  Patrice Bellot,et al.  Accurate and effective latent concept modeling for ad hoc information retrieval , 2014, Document Numérique.

[5]  Jeremiah Grossman,et al.  XSS Attacks: Cross Site Scripting Exploits and Defense , 2007 .

[6]  YoungHan Choi,et al.  Automatic Detection for JavaScript Obfuscation Attacks in Web Pages through String Pattern Analysis , 2009, FGIT.

[7]  Yang Liu,et al.  JSDC: A Hybrid Approach for JavaScript Malware Detection and Classification , 2015, AsiaCCS.

[8]  Wei Xu,et al.  The power of obfuscation techniques in malicious JavaScript code: A measurement study , 2012, 2012 7th International Conference on Malicious and Unwanted Software.

[9]  Jun Sun,et al.  Detection and classification of malicious JavaScript via attack behavior modelling , 2015, ISSTA.

[10]  Konrad Rieck,et al.  Intelligent Defense against Malicious JavaScript Code , 2012, PIK Prax. Informationsverarbeitung Kommun..

[11]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[12]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[13]  In-Chan Choi,et al.  Indexing by Latent Dirichlet Allocation and an Ensemble Model , 2013, J. Assoc. Inf. Sci. Technol..

[14]  Yao Wang,et al.  A deep learning approach for detecting malicious JavaScript code , 2016, Secur. Commun. Networks.

[15]  Mitsuaki Akiyama,et al.  Empowering Anti-malware Research in Japan by Sharing the MWS Datasets , 2015, J. Inf. Process..

[16]  Seong-je Cho,et al.  Efficient Detection of Malicious Web Pages Using High-Interaction Client Honeypots , 2012, J. Inf. Sci. Eng..

[17]  Mitsuaki Akiyama,et al.  Design and Implementation of High Interaction Client Honeypot for Drive-by-Download Attacks , 2010, IEICE Trans. Commun..

[18]  神薗 雅紀,et al.  Classification of Hostile Javascript based on Encoding Abstract Syntax Tree , 2012 .

[19]  Adnan Shahid Khan,et al.  Defending Malicious Script Attacks Using Machine Learning Classifiers , 2017, Wirel. Commun. Mob. Comput..

[20]  Brandon Pincombe,et al.  Comparison of Human and Latent Semantic Analysis (LSA) Judgements of Pairwise Document Similarities for a News Corpus , 2004 .

[21]  Rong Jin,et al.  Understanding bag-of-words model: a statistical framework , 2010, Int. J. Mach. Learn. Cybern..