The Impact of Lightweight Disassembler on Malware Detection: An Empirical Study

Malicious software poses serious threats to our lives, and the activity to detect malware is becoming more and more important. An effective approach is to train a classifier using known software samples and malware samples, and recognize malware from new software. To do that, a recent popular trend is to use OpCode, which is extracted from executable modules, as an expression of software entities to drive machine learning. However, we found that the effectiveness of such a framework highly suffers from having insufficient samples, which is caused by the low success rate of disassembly due to the intrinsic complexity of the problem. In this paper, we propose to increase the success rate of disassembly by allowing inaccurate disassembling, with the attempt to increase the number of successful disassembled samples to improve OpCode-driven malware detection. We built a lightweight disassembler D-light based on the linear swap disassembly method to avoid known issues with the recursive descent manner of IDA Pro. We carried out experiment to evaluate the performance, effectiveness, and other design factors of adopting D-light and IDA Pro as disassemblers for malware detection. The empirical study shows the D-light is both more efficient and more effective than IDA Pro in supporting malware detection.

[1]  Minh Hai Nguyen,et al.  Auto-detection of sophisticated malware using lazy-binding control flow graph and deep learning , 2018, Comput. Secur..

[2]  Daniel Bilar,et al.  Opcodes as predictor for malware , 2007, Int. J. Electron. Secur. Digit. Forensics.

[3]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[4]  Gregory R. Andrews,et al.  Disassembly of executable code revisited , 2002, Ninth Working Conference on Reverse Engineering, 2002. Proceedings..

[5]  Vlado Keselj,et al.  N-gram-based detection of new malicious code , 2004, Proceedings of the 28th Annual International Computer Software and Applications Conference, 2004. COMPSAC 2004..

[6]  Christopher Krügel,et al.  Limits of Static Analysis for Malware Detection , 2007, Twenty-Third Annual Computer Security Applications Conference (ACSAC 2007).

[7]  Wenke Lee,et al.  Ether: malware analysis via hardware virtualization extensions , 2008, CCS.

[8]  Stavros D. Nikolopoulos,et al.  A graph-based model for malware detection and classification using system-call groups , 2017, Journal of Computer Virology and Hacking Techniques.

[9]  Saumya K. Debray,et al.  Obfuscation of executable code to improve resistance to static disassembly , 2003, CCS '03.

[10]  Wenke Lee,et al.  McBoost: Boosting Scalability in Malware Collection and Analysis Using Statistical Classification of Executables , 2008, 2008 Annual Computer Security Applications Conference (ACSAC).

[11]  Yuval Elovici,et al.  Unknown malcode detection and the imbalance problem , 2009, Journal in Computer Virology.

[12]  Carsten Willems,et al.  Learning and Classification of Malware Behavior , 2008, DIMVA.

[13]  Yoseba K. Penya,et al.  Idea: Opcode-Sequence-Based Malware Detection , 2010, ESSoS.

[14]  Jiawei Han,et al.  Generalized Fisher Score for Feature Selection , 2011, UAI.

[15]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[16]  B. S. Manjunath,et al.  Malware images: visualization and automatic classification , 2011, VizSec '11.

[17]  Yuval Elovici,et al.  Monitoring, analysis, and filtering system for purifying network traffic of known and unknown malicious content , 2011, Secur. Commun. Networks.

[18]  Lior Rokach,et al.  Improving malware detection by applying multi-inducer ensemble , 2009, Comput. Stat. Data Anal..

[19]  Tzi-cker Chiueh,et al.  Automatic Generation of String Signatures for Malware Detection , 2009, RAID.

[20]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[21]  Cheng Wang,et al.  A malware variants detection methodology with an opcode based feature method and a fast density based clustering algorithm , 2016, 2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD).

[22]  Marcus A. Maloof,et al.  Learning to detect malicious executables in the wild , 2004, KDD.

[23]  Yuval Elovici,et al.  Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey , 2009, Inf. Secur. Tech. Rep..

[24]  A. Ibrahim Data Mining Methods For Malware Detection Using Instruction Sequences , 2015 .

[25]  Yuval Elovici,et al.  Detecting unknown malicious code by applying classification techniques on OpCode patterns , 2012, Security Informatics.

[26]  Si Wu,et al.  Improving support vector machine classifiers by modifying kernel functions , 1999, Neural Networks.

[27]  Maya Gokhale,et al.  Comparison of feature selection and classification algorithms in identifying malicious executables , 2007, Comput. Stat. Data Anal..

[28]  Wenke Lee,et al.  PolyUnpack: Automating the Hidden-Code Extraction of Unpack-Executing Malware , 2006, 2006 22nd Annual Computer Security Applications Conference (ACSAC'06).

[29]  J. Kent Information gain and a general measure of correlation , 1983 .

[30]  Igor Santos,et al.  Opcode sequences as representation of executables for data-mining-based unknown malware detection , 2013, Inf. Sci..

[31]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[32]  Andrew Walenstein,et al.  Malware phylogeny generation using permutations of code , 2005, Journal in Computer Virology.

[33]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[34]  Liu Ke,et al.  Analysis and forensics for behavior characteristics of malware in Internet , 2016, DSP.

[35]  Jonghyun Kim,et al.  Improvement of malware detection and classification using API call sequence alignment and visualization , 2017, Cluster Computing.