EIGER: automated IOC generation for accurate and interpretable endpoint malware detection

A malware signature including behavioral artifacts, namely Indicator of Compromise (IOC) plays an important role in security operations, such as endpoint detection and incident response. While building IOC enables us to detect malware efficiently and perform the incident analysis in a timely manner, it has not been fully-automated yet. To address this issue, there are two lines of promising approaches: regular expression-based signature generation and machine learning. However, each approach has a limitation in accuracy or interpretability, respectively. In this paper, we propose EIGER, a method to generate interpretable, and yet accurate IOCs from given malware traces. The key idea of EIGER is enumerate-then-optimize. That is, we enumerate representations of potential artifacts as candidates of IOCs. Then, we optimize the combination of these candidates to maximize the two essential properties, i.e., accuracy and interpretability, towards the generation of reliable IOCs. Through the experiment using 162K of malware samples collected over the five months, we evaluated the accuracy of EIGER-generated IOCs. We achieved a high True Positive Rate (TPR) of 91.98% and a very low False Positive Rate (FPR) of 0.97%. Interestingly, EIGER achieved FPR of less than 1% even when we use completely different dataset. Furthermore, we evaluated the interpretability of the IOCs generated by EIGER through a user study, in which we recruited 15 of professional security analysts working at a security operation center. The results allow us to conclude that our IOCs are as interpretable as manually-generated ones. These results demonstrate that EIGER is practical and deployable to the real-world security operations.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  Peter Brass Advanced Data Structures , 2008 .

[3]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[4]  Ehab Al-Shaer,et al.  TTPDrill: Automatic and Accurate Extraction of Threat Actions from Unstructured Text of CTI Sources , 2017, ACSAC.

[5]  Cindy Eisner,et al.  Accurate Malware Detection by Extreme Abstraction , 2018, ACSAC.

[6]  S. Shapiro,et al.  An Analysis of Variance Test for Normality (Complete Samples) , 1965 .

[7]  Takeo Hariu,et al.  API Chaser: Anti-analysis Resistant Malware Analyzer , 2013, RAID.

[8]  Jure Leskovec,et al.  Interpretable Decision Sets: A Joint Framework for Description and Prediction , 2016, KDD.

[9]  Lorenzo Cavallaro,et al.  TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and Time , 2018, USENIX Security Symposium.

[10]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[11]  Herbert Bos,et al.  Prudent Practices for Designing Malware Experiments: Status Quo and Outlook , 2012, 2012 IEEE Symposium on Security and Privacy.

[12]  Konrad Rieck,et al.  DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket , 2014, NDSS.

[13]  Juan Caballero,et al.  AVclass: A Tool for Massive Malware Labeling , 2016, RAID.

[14]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[15]  Babak Rahbarinia,et al.  Real-Time Detection of Malware Downloads via Large-Scale URL->File->Machine Graph Mining , 2016, AsiaCCS.

[16]  N. Johnson The MITRE corporation , 1961, ACM National Meeting.

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[19]  Roberto Perdisci,et al.  ExecScent: Mining for New C&C Domains in Live Networks with Adaptive Control Protocol Templates , 2013, USENIX Security Symposium.

[20]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[21]  Gianluca Stringhini,et al.  Marmite: Spreading Malicious File Reputation Through Download Graphs , 2017, ACSAC.

[22]  Gang Wang,et al.  LEMNA: Explaining Deep Learning based Security Applications , 2018, CCS.

[23]  Zhou Li,et al.  Acing the IOC Game: Toward Automatic Discovery and Analysis of Open-Source Cyber Threat Intelligence , 2016, CCS.

[24]  Samir Khuller,et al.  The Budgeted Maximum Coverage Problem , 1999, Inf. Process. Lett..

[25]  Tudor Dumitras,et al.  ChainSmith: Automatically Learning the Semantics of Malicious Campaigns by Mining Threat Intelligence Reports , 2018, 2018 IEEE European Symposium on Security and Privacy (EuroS&P).

[26]  Zhou Li,et al.  Lens on the Endpoint: Hunting for Malicious Software Through Endpoint Data Analysis , 2017, RAID.

[27]  Alexander Pretschner,et al.  Robust and Effective Malware Detection Through Quantitative Data Flow Graph Metrics , 2015, DIMVA.

[28]  R. Blair,et al.  A more realistic look at the robustness and Type II error properties of the t test to departures from population normality. , 1992 .

[29]  Sattar Hashemi,et al.  Malware detection based on mining API calls , 2010, SAC '10.

[30]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[31]  Dan S. Wallach,et al.  Denial of Service via Algorithmic Complexity Attacks , 2003, USENIX Security Symposium.

[32]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[33]  Somesh Jha,et al.  Synthesizing Near-Optimal Malware Specifications from Suspicious Behaviors , 2010, 2010 IEEE Symposium on Security and Privacy.

[34]  Franco Turini,et al.  A Survey of Methods for Explaining Black Box Models , 2018, ACM Comput. Surv..

[35]  Razvan Pascanu,et al.  Malware classification with recurrent networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Hiromu Yakura,et al.  Malware Analysis of Imaged Binary Samples by Convolutional Neural Network with Attention Mechanism , 2018, CODASPY.

[37]  Carsten Willems,et al.  Learning and Classification of Malware Behavior , 2008, DIMVA.

[38]  Nick Feamster,et al.  Behavioral Clustering of HTTP-Based Malware and Signature Generation Using Malicious Network Traces , 2010, NSDI.

[39]  Vahab S. Mirrokni,et al.  Maximizing Non-Monotone Submodular Functions , 2011, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[40]  Joris Kinable,et al.  Malware classification based on call graph clustering , 2010, Journal in Computer Virology.

[41]  Somesh Jha,et al.  Semantics-aware malware detection , 2005, 2005 IEEE Symposium on Security and Privacy (S&P'05).

[42]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[43]  S. Sitharama Iyengar,et al.  A Survey on Malware Detection Using Data Mining Techniques , 2017, ACM Comput. Surv..

[44]  Benjamin Livshits,et al.  Kizzle: A Signature Compiler for Detecting Exploit Kits , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[45]  Khaled Yakdan,et al.  Helping Johnny to Analyze Malware: A Usability-Optimized Decompiler and Malware Analysis User Study , 2016, 2016 IEEE Symposium on Security and Privacy (SP).