Deep Learning and Visualization for Identifying Malware Families

The growing threat of malware is becoming more and more difficult to ignore. In this paper, a malware feature images generation method is used to combine the static analysis of malicious code with the methods of recurrent neural networks (RNN) and convolutional neural networks (CNN). By using an RNN, our method considers not only the original information of malware but also the ability to associate the original code with timing characteristics; furthermore, the process reduces the dependence on category labels of malware. Then, we use minhash to generate feature images from the fusion of the original codes and the predictive codes from the RNN. Finally, we train a CNN to classify feature images. When we trained very few samples (the proportion of the sample size of training dataset to validation dataset was 1:30), we obtained accuracy over 92 percent. When we adjust the proportion to 3:1, the accuracy exceeds 99.5 percent. As shown in confusion matrices, our method obtains a good result, where the worst false positive rate of all the malware families is 0.0147 and the average false positive rate is 0.0058.

[1]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[2]  Yoshua Bengio,et al.  The problem of learning long-term dependencies in recurrent networks , 1993, IEEE International Conference on Neural Networks.

[3]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[4]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[5]  Yann LeCun,et al.  Generalization and network design strategies , 1989 .

[6]  Tomas Pfister,et al.  Learning from Simulated and Unsupervised Images through Adversarial Training , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  David Brumley,et al.  BYTEWEIGHT: Learning to Recognize Functions in Binary Code , 2014, USENIX Security Symposium.

[8]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[9]  Razvan Pascanu,et al.  Malware classification with recurrent networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Mohd Aizaini Maarof,et al.  Malware behavior image for malware variant identification , 2014, 2014 International Symposium on Biometrics and Security Technologies (ISBAST).

[11]  Christopher Krügel,et al.  Limits of Static Analysis for Malware Detection , 2007, Twenty-Third Annual Computer Security Applications Conference (ACSAC 2007).

[12]  Sepp Hochreiter,et al.  Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[13]  Aziz Makandar,et al.  Malware class recognition using image processing techniques , 2017, 2017 International Conference on Data Management, Analytics and Innovation (ICDMAI).

[14]  Aman Jantan,et al.  An approach for malware behavior identification and classification , 2011, 2011 3rd International Conference on Computer Research and Development.

[15]  Felix C. Freiling,et al.  Toward Automated Dynamic Malware Analysis Using CWSandbox , 2007, IEEE Secur. Priv..

[16]  Muhammad Zubair Shafiq,et al.  Embedded Malware Detection Using Markov n-Grams , 2008, DIMVA.

[17]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[18]  Divya Bansal,et al.  Malware Analysis and Classification: A Survey , 2014 .

[19]  Ning Xu,et al.  Malware variants detection based on opcode image recognition in small training set , 2017, 2017 IEEE 2nd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA).

[20]  Lynn Margaret Batten,et al.  Function length as a tool for malware classification , 2008, 2008 3rd International Conference on Malicious and Unwanted Software (MALWARE).

[21]  Dawn Xiaodong Song,et al.  Recognizing Functions in Binaries with Neural Networks , 2015, USENIX Security Symposium.

[22]  U. Bayer,et al.  TTAnalyze: A Tool for Analyzing Malware , 2006 .

[23]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[24]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[25]  Yongqiang Lyu,et al.  Droid-Sec , 2014, SIGCOMM.

[26]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[27]  Srinivas Mukkamala,et al.  Image visualization based malware detection , 2013, 2013 IEEE Symposium on Computational Intelligence in Cyber Security (CICS).

[28]  B. S. Manjunath,et al.  Malware images: visualization and automatic classification , 2011, VizSec '11.

[29]  Zheng Qin,et al.  IRMD: Malware Variant Detection Using Opcode Image Recognition , 2016, 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS).

[30]  Baosheng Wang,et al.  Malware classification using gray-scale images and ensemble learning , 2016, 2016 3rd International Conference on Systems and Informatics (ICSAI).

[31]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[32]  Qiaoyan Wen,et al.  Detecting android malware by applying classification techniques on images patterns , 2017, 2017 IEEE 2nd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA).

[33]  Curtis B. Storlie,et al.  Graph-based malware detection using dynamic analysis , 2011, Journal in Computer Virology.

[34]  Takeshi Yagi,et al.  Malware Detection with Deep Neural Network Using Process Behavior , 2016, 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC).

[35]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36]  Ying Tan,et al.  Black-Box Attacks against RNN based Malware Detection Algorithms , 2017, AAAI Workshops.

[37]  Eul Gyu Im,et al.  Malware analysis using visualized images and entropy graphs , 2014, International Journal of Information Security.

[38]  Marcus A. Maloof,et al.  Learning to detect malicious executables in the wild , 2004, KDD.

[39]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[40]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[41]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.