Large-Scale Identification of Malicious Singleton Files

We study a dataset of billions of program binary files that appeared on 100 million computers over the course of 12 months, discovering that 94% of these files were present on a single machine. Though malware polymorphism is one cause for the large number of singleton files, additional factors also contribute to polymorphism, given that the ratio of benign to malicious singleton files is 80:1. The huge number of benign singletons makes it challenging to reliably identify the minority of malicious singletons. We present a large-scale study of the properties, characteristics, and distribution of benign and malicious singleton files. We leverage the insights from this study to build a classifier based purely on static features to identify 92% of the remaining malicious singletons at a 1.4% percent false positive rate, despite heavy use of obfuscation and packing techniques by most malicious singleton files that we make no attempt to de-obfuscate. Finally, we demonstrate robustness of our classifier to important classes of automated evasion attacks.

[1]  Huan Liu,et al.  Chi2: feature selection and discretization of numeric attributes , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[2]  Tobias Scheffer,et al.  Stackelberg games for adversarial prediction problems , 2011, KDD.

[3]  Christopher Krügel,et al.  Effective and Efficient Malware Detection at the End Host , 2009, USENIX Security Symposium.

[4]  Barton P. Miller,et al.  Binary-code obfuscations in prevalent packer tools , 2013, CSUR.

[5]  References , 1971 .

[6]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Somesh Jha,et al.  Semantics-aware malware detection , 2005, 2005 IEEE Symposium on Security and Privacy (S&P'05).

[8]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[9]  Andrew B. Whinston,et al.  A Game Theoretic Model and Empirical Analysis of Spammer Strategies , 2010 .

[10]  Christopher Meek,et al.  Adversarial learning , 2005, KDD '05.

[11]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[12]  Jack W. Stokes,et al.  Large-scale malware classification using random projections and neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Pedro M. Domingos,et al.  Adversarial classification , 2004, KDD.

[14]  Marcus A. Maloof,et al.  Learning to Detect and Classify Malicious Executables in the Wild , 2006, J. Mach. Learn. Res..

[15]  Roberto Perdisci,et al.  Scalable fine-grained behavioral clustering of HTTP-based malware , 2013, Comput. Networks.

[16]  Yevgeniy Vorobeychik,et al.  Optimal randomized classification in adversarial settings , 2014, AAMAS.

[17]  Christos Faloutsos,et al.  Polonium: Tera-Scale Graph Mining and Inference for Malware Detection , 2011 .

[18]  Duen Horng Chau,et al.  Guilt by association: large scale malware detection by mining file-relation graphs , 2014, KDD.

[19]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[20]  Kai Li,et al.  Efficient k-nearest neighbor graph construction for generic similarity measures , 2011, WWW.

[21]  Tudor Dumitras,et al.  Toward a standard benchmark for computer security research: the worldwide intelligence network environment (WINE) , 2011, BADGERS '11.

[22]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[23]  Tzi-cker Chiueh,et al.  A Study of the Packer Problem and Its Solutions , 2008, RAID.

[24]  Yevgeniy Vorobeychik,et al.  Scalable Optimization of Randomized Operational Decisions in Adversarial Classification Settings , 2015, AISTATS.

[25]  David Brumley,et al.  BitShred: feature hashing malware for scalable triage and semantic analysis , 2011, CCS '11.

[26]  Herbert Jaeger,et al.  Reservoir computing approaches to recurrent neural network training , 2009, Comput. Sci. Rev..

[27]  Somesh Jha,et al.  OmniUnpack: Fast, Generic, and Safe Unpacking of Malware , 2007, Twenty-Third Annual Computer Security Applications Conference (ACSAC 2007).

[28]  Kang G. Shin,et al.  MutantX-S: Scalable Malware Clustering Based on Static Features , 2013, USENIX Annual Technical Conference.

[29]  Tobias Scheffer,et al.  Nash Equilibria of Static Prediction Games , 2009, NIPS.

[30]  Qian Zhang,et al.  Stackelberg game for utility-based cooperative cognitiveradio networks , 2009, MobiHoc '09.

[31]  Wenke Lee,et al.  McBoost: Boosting Scalability in Malware Collection and Analysis Using Statistical Classification of Executables , 2008, 2008 Annual Computer Security Applications Conference (ACSAC).

[32]  Yevgeniy Vorobeychik,et al.  Feature Cross-Substitution in Adversarial Classification , 2014, NIPS.