Data augmentation and transfer learning to classify malware images in a deep learning context

In the past few years, malware classification techniques have shifted from shallow traditional machine learning models to deeper neural network architectures. The main benefit of some of these is the ability to work with raw data, guaranteed by their automatic feature extraction capabilities. This results in less technical expertise needed while building the models, thus less initial pre-processing resources. Nevertheless, such advantage comes with its drawbacks, since deep learning models require huge quantities of data in order to generate a model that generalizes well. The amount of data required to train a deep network without overfitting is often unobtainable for malware analysts. We take inspiration from image-based data augmentation techniques and apply a sequence of semantics-preserving syntactic code transformations (obfuscations) to a small dataset of programs to generate a larger dataset. We then design two learning models, a convolutional neural network and a bi-directional long short-term memory, and we train them on images extracted from compiled binaries of the newly generated dataset. Through transfer learning we then take the features learned from the obfuscated binaries and train the models against two state of the art malware datasets, each containing around 10 000 samples. Our models easily achieve up to 98.5% accuracy on the test set, which is on par or better than the present state of the art approaches, thus validating the approach.

[1]  Cataldo Basile,et al.  Estimating Software Obfuscation Potency with Artificial Neural Networks , 2017, STM.

[2]  Lei Du,et al.  Malicious code detection based on CNNs and multi-objective algorithm , 2019, J. Parallel Distributed Comput..

[3]  Xi Chen,et al.  An In-Depth Analysis of Disassembly on Full-Scale x86/x64 Binaries , 2016, USENIX Security Symposium.

[4]  Daniel Gibert,et al.  A Hierarchical Convolutional Neural Network for Malware Classification , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[5]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[6]  Fabio Tozeto Ramos,et al.  Malicious Software Classification Using Transfer Learning of ResNet-50 Deep Neural Network , 2017, 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA).

[7]  Kieran McLaughlin,et al.  Obfuscation: The Hidden Malware , 2011, IEEE Security & Privacy.

[8]  Yunsick Sung,et al.  Long short-term memory-based Malware classification method for information security , 2019, Comput. Electr. Eng..

[9]  Antonio Torralba,et al.  Building the gist of a scene: the role of global image features in recognition. , 2006, Progress in brain research.

[10]  Qin Zheng,et al.  IMCFN: Image-based malware classification using fine-tuned convolutional neural network architecture , 2020, Comput. Networks.

[11]  Yoshua Bengio,et al.  Globally Trained Handwritten Word Recognizer Using Spatial Representation, Convolutional Neural Networks, and Hidden Markov Models , 1993, NIPS.

[12]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[13]  Hiromu Yakura,et al.  Neural malware analysis with attention mechanism , 2019, Comput. Secur..

[14]  R. Keys Cubic convolution interpolation for digital image processing , 1981 .

[15]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[16]  Roberto Giacobazzi,et al.  A deep learning approach to program similarity , 2018, MASES@ASE.

[17]  Ramon G. Garcia,et al.  Classification of Malware programs using autoencoders based deep learning architecture and its application to the microsoft malware Classification challenge (BIG 2015) dataset , 2017, 2017 IEEE National Aerospace and Electronics Conference (NAECON).

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Amit Sahai,et al.  On the (im)possibility of obfuscating programs , 2001, JACM.

[20]  Stefan Katzenbeisser,et al.  Protecting Software through Obfuscation , 2016, ACM Comput. Surv..

[21]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Mark Stamp,et al.  Convolutional neural networks and extreme learning machines for malware classification , 2020, Journal of Computer Virology and Hacking Techniques.

[23]  Ah Chung Tsoi,et al.  Face recognition: a convolutional neural-network approach , 1997, IEEE Trans. Neural Networks.

[24]  Taghi M. Khoshgoftaar,et al.  A survey on Image Data Augmentation for Deep Learning , 2019, Journal of Big Data.

[25]  R. Vinayakumar,et al.  A hybrid deep learning image-based analysis for effective malware detection , 2019, J. Inf. Secur. Appl..

[26]  Marco Torchiano,et al.  The effectiveness of source code obfuscation: An experimental assessment , 2009, 2009 IEEE 17th International Conference on Program Comprehension.

[27]  Kangbin Yim,et al.  Malware Obfuscation Techniques: A Brief Survey , 2010, 2010 International Conference on Broadband, Wireless Computing, Communication and Applications.

[28]  Christian S. Collberg,et al.  A Taxonomy of Obfuscating Transformations , 1997 .

[29]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[30]  Zenghui Wang,et al.  Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review , 2017, Neural Computation.

[31]  B. S. Manjunath,et al.  Malware images: visualization and automatic classification , 2011, VizSec '11.

[32]  Daniel Cremers,et al.  Regularization for Deep Learning: A Taxonomy , 2017, ArXiv.

[33]  Luis Perez,et al.  The Effectiveness of Data Augmentation in Image Classification using Deep Learning , 2017, ArXiv.

[34]  Arun Lakhotia,et al.  DroidLegacy: Automated Familial Classification of Android Malware , 2014, PPREW'14.

[35]  Farhan Ullah,et al.  Malware detection in industrial internet of things based on hybrid image visualization and deep learning model , 2020, Ad Hoc Networks.

[36]  Yajin Zhou,et al.  Dissecting Android Malware: Characterization and Evolution , 2012, 2012 IEEE Symposium on Security and Privacy.

[37]  Mark Stamp,et al.  Transfer Learning for Image-Based Malware Classification , 2019, ICISSP.