Probabilistic analysis of dynamic malware traces

Abstract We propose a method to automatically group unknown binaries executed in sandbox according to their interaction with system resources (files on the filesystem, mutexes, registry keys, network communication with remote servers and error messages generated by operating system) such that each group corresponds to a malware family. The method utilizes probabilistic generative model (Bernoulli mixture model), which allows human-friendly prioritization of identified clusters and extraction of readable behavioral indicators to maximize interpretability. We compare it to relevant prior art on a large set of malware binaries where a quality of cluster prioritization and automatic extraction of indicators of compromise is demonstrated. The proposed approach therefore implements complete pipeline which has the potential to significantly speed-up analysis of unknown samples.

[1]  Christopher Krügel,et al.  Scalable, Behavior-Based Malware Clustering , 2009, NDSS.

[2]  Wenke Lee,et al.  Ether: malware analysis via hardware virtualization extensions , 2008, CCS.

[3]  Anil K. Jain,et al.  Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Mansour Ahmadi,et al.  Novel Feature Extraction, Selection and Fusion for Effective Malware Family Classification , 2015, CODASPY.

[5]  Wouter Joosen,et al.  Advanced or Not? A Comparative Study of the Use of Anti-debugging and Anti-VM Techniques in Generic and Targeted Malware , 2016, SEC.

[6]  Jens Myrup Pedersen,et al.  An approach for detection and family classification of malware based on behavioral analysis , 2016, 2016 International Conference on Computing, Networking and Communications (ICNC).

[7]  Gérard Govaert,et al.  Block clustering with Bernoulli mixture models: Comparison of different approaches , 2008, Comput. Stat. Data Anal..

[8]  Carsten Willems,et al.  Learning and Classification of Malware Behavior , 2008, DIMVA.

[9]  Vinod Yegneswaran,et al.  Eureka: A Framework for Enabling Static Malware Analysis , 2008, ESORICS.

[10]  G. Wahba,et al.  Multivariate Bernoulli distribution , 2012, 1206.1874.

[11]  Guofei Gu,et al.  Shadow attacks: automatically evading system-call-behavior based malware detection , 2011, Journal in Computer Virology.

[12]  Jonathon T. Giffin,et al.  Impeding Malware Analysis Using Conditional Code Obfuscation , 2008, NDSS.

[13]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[14]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[15]  Daniel Müllner,et al.  fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python , 2013 .

[16]  Juan Caballero,et al.  AVclass: A Tool for Massive Malware Labeling , 2016, RAID.

[17]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[18]  Jan Kohout,et al.  Using Behavioral Similarity for Botnet Command-and-Control Discovery , 2016, IEEE Intelligent Systems.

[19]  Alexander Pretschner,et al.  Malware detection with quantitative data flow graphs , 2014, AsiaCCS.

[20]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[21]  Kieran McLaughlin,et al.  SVM Training Phase Reduction Using Dataset Feature Filtering for Malware Detection , 2013, IEEE Transactions on Information Forensics and Security.

[22]  Somesh Jha,et al.  Static Analysis of Executables to Detect Malicious Patterns , 2003, USENIX Security Symposium.

[23]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[24]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[25]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[26]  Dani Gamerman,et al.  Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference , 1997 .

[27]  Yixin Chen,et al.  CLUE: cluster-based retrieval of images by unsupervised learning , 2005, IEEE Transactions on Image Processing.

[28]  Muttukrishnan Rajarajan,et al.  Employing Program Semantics for Malware Detection , 2015, IEEE Transactions on Information Forensics and Security.

[29]  Jaume Amores,et al.  Multiple instance classification: Review, taxonomy and comparative study , 2013, Artif. Intell..

[30]  Ke Wang,et al.  Fileprints: identifying file types by n-gram analysis , 2005, Proceedings from the Sixth Annual IEEE SMC Information Assurance Workshop.

[31]  Christopher Krügel,et al.  AccessMiner: using system-centric models for malware protection , 2010, CCS '10.

[32]  Digit Oktavianto,et al.  Cuckoo Malware Analysis , 2013 .

[33]  Jan Kohout,et al.  Automatic discovery of web servers hosting similar applications , 2015, 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM).

[34]  Angelos D. Keromytis,et al.  Countering code-injection attacks with instruction-set randomization , 2003, CCS '03.

[35]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[36]  Mehryar Mohri,et al.  Algorithms for Learning Kernels Based on Centered Alignment , 2012, J. Mach. Learn. Res..

[37]  Tomás Pevný,et al.  Multiple instance learning for malware classification , 2017, Expert Syst. Appl..

[38]  Carsten Willems,et al.  Automatic analysis of malware behavior using machine learning , 2011, J. Comput. Secur..

[39]  Karl N. Levitt,et al.  MCF: a malicious code filter , 1995, Comput. Secur..

[40]  Christopher Krügel,et al.  Limits of Static Analysis for Malware Detection , 2007, Twenty-Third Annual Computer Security Applications Conference (ACSAC 2007).

[41]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[42]  Aziz Mohaisen,et al.  AMAL: High-fidelity, behavior-based automated malware analysis and classification , 2014, Comput. Secur..

[43]  Felix C. Freiling,et al.  Toward Automated Dynamic Malware Analysis Using CWSandbox , 2007, IEEE Secur. Priv..

[44]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[45]  Alfons Juan-Císcar,et al.  On the use of Bernoulli mixture models for text classification , 2001, Pattern Recognit..