Malware Detection on Byte Streams of PDF Files Using Convolutional Neural Networks

With increasing amount of data, the threat of malware keeps growing recently. The malicious actions embedded in nonexecutable documents especially (e.g., PDF files) can be more dangerous, because it is difficult to detect and most users are not aware of such type of malicious attacks. In this paper, we design a convolutional neural network to tackle the malware detection on the PDF files. We collect malicious and benign PDF files and manually label the byte sequences within the files. We intensively examine the structure of the input data and illustrate how we design the proposed network based on the characteristics of data. The proposed network is designed to interpret high-level patterns among collectable spatial clues, thereby predicting whether the given byte sequence has malicious actions or not. By experimental results, we demonstrate that the proposed network outperform several representative machine-learning models as well as other networks with different settings.

[1]  Brian Mac Namee,et al.  Deep learning at the shallow end: Malware classification for non-domain experts , 2018, Digit. Investig..

[2]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[8]  W. Marsden I and J , 2012 .

[9]  Claudia Eckert,et al.  Deep Learning for Classification of Malware System Call Sequences , 2016, Australasian Conference on Artificial Intelligence.

[10]  Wenyi Huang,et al.  MtNet: A Multi-Task Neural Network for Dynamic Malware Classification , 2016, DIMVA.

[11]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[13]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[14]  Pavel Laskov,et al.  Hidost: a static machine-learning-based detector of malicious files , 2016, EURASIP J. Inf. Secur..

[15]  Chao Liu,et al.  FEPDF: A Robust Feature Extractor for Malicious PDF Detection , 2017, 2017 IEEE Trustcom/BigDataSE/ICESS.

[16]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[17]  Jason Zhang,et al.  MLPdf: An Effective Machine Learning Based Approach for PDF Malware Detection , 2018, ArXiv.

[18]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[19]  Aliénor Damien,et al.  Malware Detection in PDF Files using Machine Learning , 2018, ICETE.

[20]  Ali Hadi,et al.  PDF Forensic Analysis System using YARA , 2017 .

[21]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Youssef Iraqi,et al.  Investigating the Impact of Possession-Way of a Smartphone on Action Recognition † , 2016, Sensors.

[23]  Jon Barker,et al.  Malware Detection by Eating a Whole EXE , 2017, AAAI Workshops.

[24]  Angelos Stavrou,et al.  Detecting Malicious Javascript in PDF through Document Instrumentation , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[25]  Angelos Stavrou,et al.  Malicious PDF detection using metadata and structural features , 2012, ACSAC '12.

[26]  Long Tang,et al.  Instantaneous Real-Time Kinematic Decimeter-Level Positioning with BeiDou Triple-Frequency Signals over Medium Baselines , 2015, Sensors.

[27]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[28]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[29]  Dana Kulic,et al.  Data augmentation of wearable sensor data for parkinson’s disease monitoring using convolutional neural networks , 2017, ICMI.

[30]  Konstantin Berlin,et al.  Deep neural network based malware detection using two dimensional binary program features , 2015, 2015 10th International Conference on Malicious and Unwanted Software (MALWARE).

[31]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Ross B. Girshick,et al.  Reducing Overfitting in Deep Networks by Decorrelating Representations , 2015, ICLR.

[33]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[34]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[35]  Edward Raff,et al.  Learning the PE Header, Malware Detection with Minimal Domain Knowledge , 2017, AISec@CCS.