A feature-hybrid malware variants detection using CNN based opcode embedding and BPNN based API embedding

Abstract Being able to detect malware variants is a critical problem due to the potential damages and the fast paces of new malware variations. According to surveys from McAfee and Symantec, there is about 69 new instances of malware detected in every minutes, and more than 50% of them are variants of existing ones. Such a large volume of diversified malware variants has forced researches to investigate new methods based on common behavior patterns using machine learning. However, such methods only use single type of features such as opcode, system call, etc., which faces several drawbacks: Firstly, the methods lose a part of useful information since different types of features show different characteristics of malware. This severely limits detection precision and recall. Secondly, the accuracy and the speed (as a trade-off) of such methods fail to meet users′ expectation. Thirdly, the precise classification of malware families is still a hard problem and is also important in malware analysis. In this work, we propose a feature-hybrid malware variants detection approach which integrates multi-types of features to address these challenges. We first represent opcodes by a bi-gram model and represent API calls by a vector of frequency, then we use principal component analysis to optimize the representations to improve the convergence speed, the next we adopt a convolutional neural network and a back-propagation neural network for opcode based feature embedding and API based feature embedding respectively, and finally we embed these features to train a detection model by using softmax. Theoretical analysis and real-life experimental results show the efficiency and optimization of our approach which achieves more than 95% malware detection accuracy and almost 90% classification accuracy of malware families. The detection speed of our approach is less than 0.1 s.

[1]  Sung-Bae Cho,et al.  Zero-day malware detection using transferred generative adversarial networks based on deep autoencoders , 2018, Inf. Sci..

[2]  Xiangliang Zhang,et al.  Exploring Permission-Induced Risk in Android Applications for Malicious Application Detection , 2014, IEEE Transactions on Information Forensics and Security.

[3]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[4]  Duen Horng Chau,et al.  Guilt by association: large scale malware detection by mining file-relation graphs , 2014, KDD.

[5]  Peng Wang,et al.  AsDroid: detecting stealthy behaviors in Android applications by user interface and program behavior contradiction , 2014, ICSE.

[6]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[7]  Gianluca Stringhini,et al.  Marmite: Spreading Malicious File Reputation Through Download Graphs , 2017, ACSAC.

[8]  Adam Doupé,et al.  Deep Android Malware Detection , 2017, CODASPY.

[9]  Barbara G. Ryder,et al.  Detection of Repackaged Android Malware with Code-Heterogeneity Features , 2020, IEEE Transactions on Dependable and Secure Computing.

[10]  Sencun Zhu,et al.  Privacy Risk Analysis and Mitigation of Analytics Libraries in the Android Ecosystem , 2020, IEEE Transactions on Mobile Computing.

[11]  Heng Yin,et al.  DroidScope: Seamlessly Reconstructing the OS and Dalvik Semantic Views for Dynamic Android Malware Analysis , 2012, USENIX Security Symposium.

[12]  Yong Qi,et al.  Detecting Malware with an Ensemble Method Based on Deep Neural Network , 2018, Secur. Commun. Networks.

[13]  J. Borwein,et al.  Two-Point Step Size Gradient Methods , 1988 .

[14]  Zheng Qin,et al.  IRMD: Malware Variant Detection Using Opcode Image Recognition , 2016, 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS).

[15]  Xiangliang Zhang,et al.  Detecting Android malicious apps and categorizing benign apps with ensemble of classifiers , 2018, Future Gener. Comput. Syst..

[16]  Yidong Li,et al.  DroidEnsemble: Detecting Android Malicious Applications With Ensemble of String and Structural Static Features , 2018, IEEE Access.

[17]  Moshe Kam,et al.  System Call-Based Detection of Malicious Processes , 2015, 2015 IEEE International Conference on Software Quality, Reliability and Security.

[18]  Farnam Jahanian,et al.  CloudAV: N-Version Antivirus in the Network Cloud , 2008, USENIX Security Symposium.

[19]  Igor Santos,et al.  Opcode sequences as representation of executables for data-mining-based unknown malware detection , 2013, Inf. Sci..

[20]  Yuval Elovici,et al.  “Andromaly”: a behavioral malware detection framework for android devices , 2012, Journal of Intelligent Information Systems.

[21]  Zheng Qin,et al.  Dalvik Opcode Graph Based Android Malware Variants Detection Using Global Topology Features , 2018, IEEE Access.

[22]  Borja Sanz,et al.  Using Dalvik opcodes for malware detection on android , 2015, Log. J. IGPL.

[23]  Byung-Gon Chun,et al.  TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones , 2010, OSDI.

[24]  Christopher Krügel,et al.  Effective and Efficient Malware Detection at the End Host , 2009, USENIX Security Symposium.

[25]  Carsten Willems,et al.  Automatic analysis of malware behavior using machine learning , 2011, J. Comput. Secur..

[26]  Simin Nadjm-Tehrani,et al.  Crowdroid: behavior-based malware detection system for Android , 2011, SPSM '11.

[27]  Sakir Sezer,et al.  N-opcode analysis for android malware classification and categorization , 2016, 2016 International Conference On Cyber Security And Protection Of Digital Services (Cyber Security).

[28]  Jon Barker,et al.  Malware Detection by Eating a Whole EXE , 2017, AAAI Workshops.

[29]  Vinod Yegneswaran,et al.  A comparative assessment of malware classification using binary texture analysis and dynamic analysis , 2011, AISec '11.

[30]  Wanlei Zhou,et al.  Control Flow-Based Malware VariantDetection , 2014, IEEE Transactions on Dependable and Secure Computing.

[31]  Igor Santos,et al.  OPEM: A Static-Dynamic Approach for Machine-Learning-Based Malware Detection , 2012, CISIS/ICEUTE/SOCO Special Sessions.

[32]  Zheng Qin,et al.  Malware Variant Detection Using Opcode Image Recognition with Small Training Sets , 2016, 2016 25th International Conference on Computer Communication and Networks (ICCCN).

[33]  Sukumar Nandi,et al.  Obfuscated malware detection using API call dependency , 2012, SecurIT '12.

[34]  Jian Liu,et al.  Understanding Android Obfuscation Techniques: A Large-Scale Investigation in the Wild , 2018, SecureComm.

[35]  Xudong Ma,et al.  Dynamic Android Malware Classification Using Graph-Based Representations , 2016, 2016 IEEE 3rd International Conference on Cyber Security and Cloud Computing (CSCloud).

[36]  Yanfang Ye,et al.  Analyzing File-to-File Relation Network in Malware Detection , 2015, WISE.

[37]  David Camacho,et al.  MOCDroid: multi-objective evolutionary classifier for Android malware detection , 2017, Soft Comput..

[38]  Roberto Baldoni,et al.  Android malware family classification based on resource consumption over time , 2017, 2017 12th International Conference on Malicious and Unwanted Software (MALWARE).

[39]  Qinghua Zheng,et al.  Android Malware Familial Classification and Representative Sample Selection via Frequent Subgraph Analysis , 2018, IEEE Transactions on Information Forensics and Security.

[40]  Zheng Qin,et al.  Sensitive system calls based packed malware variants detection using principal component initialized MultiLayers neural networks , 2018, Cybersecur..

[41]  Wenke Lee,et al.  PolyUnpack: Automating the Hidden-Code Extraction of Unpack-Executing Malware , 2006, 2006 22nd Annual Computer Security Applications Conference (ACSAC'06).