A Machine Learning Approach to Malicious JavaScript Detection using Fixed Length Vector Representation

To add more functionality and enhance usability of web applications, JavaScript (JS) is frequently used. Even with many advantages and usefulness of JS, an annoying fact is that many recent cyberattacks such as drive-by-download attacks exploit vulnerability of JS codes. In general, malicious JS codes are not easy to detect, because they sneakily exploit vulnerabilities of browsers and plugin software, and attack visitors of a web site unknowingly. To protect users from such threads, the development of an accurate detection system for malicious JS is soliciting. Conventional approaches often employ signature and heuristic-based methods, which are prone to suffer from zero-day attacks, i.e., causing many false negatives and/or false positives. For this problem, this paper adopts a machine-learning approach to feature learning called Doc2Vec, which is a neural network model that can learn context information of texts. The extracted features are given to a classifier model (e.g., SVM and neural networks) and it judges the maliciousness of a JS code. In the performance evaluation, we use the D3M Dataset (Drive-by-Download Data by Marionette) for malicious JS codes and JSUPACK for benign ones for both training and test purposes. We then compare the performance to other feature learning methods. Our experimental results show that the proposed Doc2Vec features provide better accuracy and fast classification in malicious JS code detection compared to conventional approaches.

[1]  Mitsuaki Akiyama,et al.  Empowering Anti-malware Research in Japan by Sharing the MWS Datasets , 2015, J. Inf. Process..

[2]  Seong-je Cho,et al.  Efficient Detection of Malicious Web Pages Using High-Interaction Client Honeypots , 2012, J. Inf. Sci. Eng..

[3]  Adnan Shahid Khan,et al.  Defending Malicious Script Attacks Using Machine Learning Classifiers , 2017, Wirel. Commun. Mob. Comput..

[4]  Olof Mogren,et al.  Malicious JavaScript detection using machine learning , 2017 .

[5]  Wei Xu,et al.  The power of obfuscation techniques in malicious JavaScript code: A measurement study , 2012, 2012 7th International Conference on Malicious and Unwanted Software.

[6]  Natheer Khasawneh,et al.  Analysis and Identification of Malicious JavaScript Code , 2012, Inf. Secur. J. A Glob. Perspect..

[7]  YoungHan Choi,et al.  Automatic Detection for JavaScript Obfuscation Attacks in Web Pages through String Pattern Analysis , 2009, FGIT.

[8]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9]  Andrew Y. Ng,et al.  Parsing Natural Scenes and Natural Language with Recursive Neural Networks , 2011, ICML.

[10]  Marius Kloft,et al.  Early detection of malicious behavior in JavaScript code , 2012, AISec '12.

[11]  Omer F. Rana,et al.  Honeyware: A Web-Based Low Interaction Client Honeypot , 2010, 2010 Third International Conference on Software Testing, Verification, and Validation Workshops.

[12]  Mahdi Abadi,et al.  Detecting Obfuscated JavaScript Malware Using Sequences of Internal Function Calls , 2014, ACM Southeast Regional Conference.

[13]  Konrad Rieck,et al.  Intelligent Defense against Malicious JavaScript Code , 2012, PIK Prax. Informationsverarbeitung Kommun..

[14]  Wei-Hong Wang,et al.  A Static Malicious Javascript Detection Using SVM , 2013 .

[15]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[16]  Yao Wang,et al.  A deep learning approach for detecting malicious JavaScript code , 2016, Secur. Commun. Networks.

[17]  Christopher Krügel,et al.  Detection and analysis of drive-by-download attacks and malicious JavaScript code , 2010, WWW '10.