Learning from Context: Exploiting and Interpreting File Path Information for Better Malware Detection

Machine learning (ML) used for static portable executable (PE) malware detection typically employs per-file numerical feature vector representations as input with one or more target labels during training. However, there is much orthogonal information that can be gleaned from the \textit{context} in which the file was seen. In this paper, we propose utilizing a static source of contextual information -- the path of the PE file -- as an auxiliary input to the classifier. While file paths are not malicious or benign in and of themselves, they do provide valuable context for a malicious/benign determination. Unlike dynamic contextual information, file paths are available with little overhead and can seamlessly be integrated into a multi-view static ML detector, yielding higher detection rates at very high throughput with minimal infrastructural changes. Here we propose a multi-view neural network, which takes feature vectors from PE file content as well as corresponding file paths as inputs and outputs a detection score. To ensure realistic evaluation, we use a dataset of approximately 10 million samples -- files and file paths from user endpoints of an actual security vendor network. We then conduct an interpretability analysis via LIME modeling to ensure that our classifier has learned a sensible representation and see which parts of the file path most contributed to change in the classifier's score. We find that our model learns useful aspects of the file path for classification, while also learning artifacts from customers testing the vendor's product, e.g., by downloading a directory of malware samples each named as their hash. We prune these artifacts from our test dataset and demonstrate reductions in false negative rate of 32.3% at a $10^{-3}$ false positive rate (FPR) and 33.1% at $10^{-4}$ FPR, over a similar topology single input PE file content only model.

[1]  Philip K. Chan,et al.  Scalable Function Call Graph-based Malware Classification , 2017, CODASPY.

[2]  Mansour Ahmadi,et al.  Novel Feature Extraction, Selection and Fusion for Effective Malware Family Classification , 2015, CODASPY.

[3]  Vishal M. Patel,et al.  In2I: Unsupervised Multi-Image-to-Image Translation Using Generative Adversarial Networks , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[4]  Subhransu Maji,et al.  Multi-view Convolutional Neural Networks for 3D Shape Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Thomas Vetter,et al.  Face Recognition Based on Fitting a 3D Morphable Model , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Mitchell Mays,et al.  Feature Selection for Malware Classification , 2017, MAICS.

[7]  Steffen Bickel,et al.  Multi-view clustering , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[8]  Richard E. Harang,et al.  MEADE: Towards a Malicious Email Attachment Detection Engine , 2018, 2018 IEEE International Symposium on Technologies for Homeland Security (HST).

[9]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[10]  Konstantin Berlin,et al.  Deep neural network based malware detection using two dimensional binary program features , 2015, 2015 10th International Conference on Malicious and Unwanted Software (MALWARE).

[11]  Barath Narayanan Narayanan,et al.  Performance analysis of machine learning and pattern recognition algorithms for Malware classification , 2016, 2016 IEEE National Aerospace and Electronics Conference (NAECON) and Ohio Innovation Summit (OIS).

[12]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Mark Stamp,et al.  A comparison of static, dynamic, and hybrid analysis for malware detection , 2015, Journal of Computer Virology and Hacking Techniques.

[14]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[15]  Yang Liu,et al.  A multi-view context-aware approach to Android malware detection and malicious code localization , 2017, Empirical Software Engineering.

[16]  Hyrum S. Anderson,et al.  EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models , 2018, ArXiv.

[17]  Jon Barker,et al.  Malware Detection by Eating a Whole EXE , 2017, AAAI Workshops.

[18]  Rich Caruana,et al.  Multitask Learning: A Knowledge-Based Source of Inductive Bias , 1993, ICML.

[19]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[20]  Junfeng Wang,et al.  Improving malware detection using multi-view ensemble learning , 2016, Secur. Commun. Networks.

[21]  Terrance E. Boult,et al.  A Survey of Stealth Malware Attacks, Mitigation Measures, and Steps Toward Autonomous Open World Solutions , 2016, IEEE Communications Surveys & Tutorials.

[22]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[23]  Christopher Krügel,et al.  Limits of Static Analysis for Malware Detection , 2007, Twenty-Third Annual Computer Security Applications Conference (ACSAC 2007).

[24]  Wenyi Huang,et al.  MtNet: A Multi-Task Neural Network for Dynamic Malware Classification , 2016, DIMVA.

[25]  Richard E. Harang,et al.  ALOHA: Auxiliary Loss Optimization for Hypothesis Augmentation , 2019, USENIX Security Symposium.

[26]  Huashan Chen,et al.  Statistical Estimation of Malware Detection Metrics in the Absence of Ground Truth , 2018, IEEE Transactions on Information Forensics and Security.

[27]  Bo Wu,et al.  Fast rotation invariant multi-view face detection based on real Adaboost , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[28]  Henrik Aanæs,et al.  Large Scale Multi-view Stereopsis Evaluation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Konstantin Berlin,et al.  eXpose: A Character-Level Convolutional Neural Network with Embeddings For Detecting Malicious URLs, File Paths and Registry Keys , 2017, ArXiv.

[30]  Pascal Fua,et al.  Efficient large-scale multi-view stereo for ultra high-resolution image sets , 2011, Machine Vision and Applications.

[31]  Michael Carl Tschantz,et al.  Better Malware Ground Truth: Techniques for Weighting Anti-Virus Vendor Labels , 2015, AISec@CCS.

[32]  Harry Shum,et al.  Statistical Learning of Multi-view Face Detection , 2002, ECCV.

[33]  James Black,et al.  Multi view image surveillance and tracking , 2002, Workshop on Motion and Video Computing, 2002. Proceedings..

[34]  Terrance E. Boult,et al.  MOON: A Mixed Objective Optimization Network for the Recognition of Facial Attributes , 2016, ECCV.

[35]  Tyler Moore,et al.  Polymorphic Malware Detection Using Sequence Classification Methods , 2016, 2016 IEEE Security and Privacy Workshops (SPW).

[36]  Christopher Krügel,et al.  A survey on automated dynamic malware-analysis techniques and tools , 2012, CSUR.

[37]  Paul A. Viola,et al.  Fast Multi-view Face Detection , 2003 .

[38]  Li-Jia Li,et al.  Multi-view Face Detection Using Deep Convolutional Neural Networks , 2015, ICMR.

[39]  Edward Raff,et al.  Learning the PE Header, Malware Detection with Minimal Domain Knowledge , 2017, AISec@CCS.

[40]  Manaal Faruqui,et al.  Improving Vector Space Word Representations Using Multilingual Correlation , 2014, EACL.

[41]  Aníbal R. Figueiras-Vidal,et al.  Pattern classification with missing data: a review , 2010, Neural Computing and Applications.

[42]  Richard E. Harang,et al.  A Deep Learning Approach to Fast, Format-Agnostic Detection of Malicious Web Content , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[43]  Dacheng Tao,et al.  A Survey on Multi-view Learning , 2013, ArXiv.

[44]  Mahmood Yousefi-Azar,et al.  Autoencoder-based feature learning for cyber security applications , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[45]  Mansour Ahmadi,et al.  Microsoft Malware Classification Challenge , 2018, ArXiv.

[46]  Philip K. Chan,et al.  Malware classification using static analysis based features , 2017, 2017 IEEE Symposium Series on Computational Intelligence (SSCI).