Large Scale Malware Analysis, Detection and Signature Generation

As the primary vehicle for most organized cybercrimes, malicious software (or malware) has become one of the most serious threats to computer systems and the Internet. With the recent advent of automated malware development toolkits, it has become relatively easy, even for marginally skilled adversaries, to create and mutate malware, bypassing Anti-Virus (AV) detection. This has led to a surge in the number of new malware threats and has created several major challenges for the AV industry. AV companies typically receive tens of thousands of suspicious samples daily. However, the overwhelming number of new malware easily overtax the available human resources at AV companies, making them less responsive to emerging threats and leading to poor detection rates. To address these issues, this dissertation proposes several new and scalable systems to facilitate malware analysis and detection, with the focus on a central theme: “automation and scalability”. This dissertation makes four primary contributions. First, it builds a large-scale malware database management system called SMIT that addresses the challenges of determining whether a suspicious sample is indeed malicious. SMIT exploits the insight that most new malicious samples are simple syntactic variations of existing malware. Thus, one way to ascertain the maliciousness of an unknown sample is to check if it is sufficiently similar to any existing malware. SMIT is designed to make such decisions efficiently using malware's function call graph—a high-level structural representation that is less susceptible to the low-level obfuscation employed by malware writers to evade detection. Second, the dissertation develops an automatic malware clustering system called MutantX. By quickly grouping similar samples into clusters, MutantX allows malware analysts to focus on representative samples and automatically generate labels based on samples' association with existing groups. Third, this dissertation introduces a signature-generation system, called Hancock, that automatically creates high-quality string signatures with extremely low false-positive rates. Finally, observing that two widely used malware analysis approaches—i.e., static and dynamic analyses—have their respective pros and cons, this dissertation proposes a novel system that optimally integrates static-feature and dynamic-behavior based malware clusterings, mitigating their respective shortcomings without losing their merits.

[1]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[2]  Yuchou Chang,et al.  Unsupervised feature selection using clustering ensembles and population based incremental learning algorithm , 2008, Pattern Recognit..

[3]  Kaspar Riesen,et al.  Bipartite Graph Matching for Computing the Edit Distance of Graphs , 2007, GbRPR.

[4]  Edwin R. Hancock,et al.  Bayesian Graph Edit Distance , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Christopher Krügel,et al.  Scalable, Behavior-Based Malware Clustering , 2009, NDSS.

[6]  Helen J. Wang,et al.  Shield: vulnerability-driven network filters for preventing known vulnerability exploits , 2004, SIGCOMM 2004.

[7]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[8]  Carsten Willems,et al.  Automatic analysis of malware behavior using machine learning , 2011, J. Comput. Secur..

[9]  Wenke Lee,et al.  Classification of packed executables for accurate computer virus detection , 2008, Pattern Recognit. Lett..

[10]  Andrew Walenstein,et al.  Malware phylogeny generation using permutations of code , 2005, Journal in Computer Virology.

[11]  Christopher Krügel,et al.  Static Disassembly of Obfuscated Binaries , 2004, USENIX Security Symposium.

[12]  Dennis Shasha,et al.  Algorithmics and applications of tree and graph searching , 2002, PODS.

[13]  Xiaohua Hu,et al.  Cluster Ensemble and Its Applications in Gene Expression Analysis , 2004, APBC.

[14]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[15]  Carla E. Brodley,et al.  Solving cluster ensemble problems by bipartite graph partitioning , 2004, ICML.

[16]  William C. Arnold,et al.  AUTOMATICALLY GENERATED WIN32 HEURISTIC VIRUS DETECTION , 2000 .

[17]  Pavel Zezula,et al.  Similarity Search - The Metric Space Approach , 2005, Advances in Database Systems.

[18]  Jun Xu,et al.  Packet vaccine: black-box exploit detection and signature generation , 2006, CCS '06.

[19]  Jason Raber,et al.  Deobfuscator: An Automated Approach to the Identification and Removal of Code Obfuscation , 2007, 14th Working Conference on Reverse Engineering (WCRE 2007).

[20]  Julian R. Ullmann,et al.  An Algorithm for Subgraph Isomorphism , 1976, J. ACM.

[21]  Georg Wicherski,et al.  peHash: A Novel Approach to Fast Malware Clustering , 2009, LEET.

[22]  Zhenkai Liang,et al.  Fast and automated generation of attack signatures: a basis for building self-protecting servers , 2005, CCS '05.

[23]  Jignesh M. Patel,et al.  TALE: A Tool for Approximate Large Graph Matching , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[24]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[25]  Hamidah Ibrahim,et al.  A Survey: Clustering Ensembles Techniques , 2009 .

[26]  Alfred O. Hero,et al.  A binary linear programming formulation of the graph edit distance , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Peng Ning,et al.  Automatic diagnosis and response to memory corruption vulnerabilities , 2005, CCS '05.

[28]  Ana L. N. Fred,et al.  Data clustering using evidence accumulation , 2002, Object recognition supported by user interaction for service robots.

[29]  Ran El-Yaniv,et al.  On Prediction Using Variable Order Markov Models , 2004, J. Artif. Intell. Res..

[30]  James Newsome,et al.  Polygraph: automatically generating signatures for polymorphic worms , 2005, 2005 IEEE Symposium on Security and Privacy (S&P'05).

[31]  Sergei Vassilvitskii,et al.  How slow is the k-means method? , 2006, SCG '06.

[32]  Zhuoqing Morley Mao,et al.  Automated Classification and Analysis of Internet Malware , 2007, RAID.

[33]  Gran Vía,et al.  GRAPHS, ENTROPY AND GRID COMPUTING: AUTOMATIC COMPARISON OF MALWARE , 2008 .

[34]  Somesh Jha,et al.  Static Analysis of Executables to Detect Malicious Patterns , 2003, USENIX Security Symposium.

[35]  Gregory R. Andrews,et al.  Binary Obfuscation Using Signals , 2007, USENIX Security Symposium.

[36]  Christopher Krügel,et al.  Efficient Detection of Split Personalities in Malware , 2010, NDSS.

[37]  Thomas Dullien,et al.  Graph-based comparison of Executable Objects , 2005 .

[38]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[39]  Alberto Del Bimbo,et al.  Efficient Matching and Indexing of Graph Models in Content-Based Retrieval , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[40]  Philip S. Yu,et al.  Substructure similarity search in graph databases , 2005, SIGMOD '05.

[41]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[42]  Horst Bunke,et al.  An Error-Tolerant Approximate Matching Algorithm for Attributed Planar Graphs and Its Application to Fingerprint Classification , 2004, SSPR/SPR.

[43]  Miguel Castro,et al.  Vigilante: end-to-end containment of internet worms , 2005, SOSP '05.

[44]  Saumya K. Debray,et al.  Obfuscation of executable code to improve resistance to static disassembly , 2003, CCS '03.

[45]  Philip S. Yu,et al.  Graph Indexing: Tree + Delta >= Graph , 2007, VLDB.

[46]  Somesh Jha,et al.  OmniUnpack: Fast, Generic, and Safe Unpacking of Malware , 2007, Twenty-Third Annual Computer Security Applications Conference (ACSAC 2007).

[47]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[48]  Saumya K. Debray,et al.  Deobfuscation: reverse engineering obfuscated code , 2005, 12th Working Conference on Reverse Engineering (WCRE'05).

[49]  Philip S. Yu,et al.  Graph indexing: a frequent structure-based approach , 2004, SIGMOD '04.

[50]  Halvar Flake,et al.  Structural Comparison of Executable Objects , 2004, DIMVA.

[51]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[52]  Z. Meral Özsoyoglu,et al.  Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.

[53]  James Newsome,et al.  Dynamic Taint Analysis for Automatic Detection, Analysis, and SignatureGeneration of Exploits on Commodity Software , 2005, NDSS.

[54]  Helen J. Wang,et al.  ShieldGen: Automatic Data Patch Generation for Unknown Vulnerabilities with Informed Probing , 2007, 2007 IEEE Symposium on Security and Privacy (SP '07).

[55]  Carsten Willems,et al.  Learning and Classification of Malware Behavior , 2008, DIMVA.

[56]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[57]  Ambuj K. Singh,et al.  Closure-Tree: An Index Structure for Graph Queries , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[58]  Tzi-cker Chiueh,et al.  Content-Based Image Indexing , 1994, VLDB.

[59]  Jon Crowcroft,et al.  Honeycomb , 2004, Comput. Commun. Rev..

[60]  Yong Chen,et al.  Automatic malware categorization using cluster ensemble , 2010, KDD.

[61]  George Varghese,et al.  Automated Worm Fingerprinting , 2004, OSDI.

[62]  Tzi-cker Chiueh,et al.  A Study of the Packer Problem and Its Solutions , 2008, RAID.

[63]  Christopher Krügel,et al.  Polymorphic Worm Detection Using Structural Information of Executables , 2005, RAID.

[64]  Somesh Jha,et al.  Semantics-aware malware detection , 2005, 2005 IEEE Symposium on Security and Privacy (S&P'05).

[65]  Yong Tang,et al.  Defending against Internet worms: a signature-based approach , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[66]  Hao Wang,et al.  Towards automatic generation of vulnerability-based signatures , 2006, 2006 IEEE Symposium on Security and Privacy (S&P'06).

[67]  Ana L. N. Fred,et al.  Finding Consistent Clusters in Data Partitions , 2001, Multiple Classifier Systems.

[68]  Ming-Yang Kao,et al.  Hamsa: fast signature generation for zero-day polymorphic worms with provable attack resilience , 2006, 2006 IEEE Symposium on Security and Privacy (S&P'06).

[69]  Anil K. Jain,et al.  Combining multiple weak clusterings , 2003, Third IEEE International Conference on Data Mining.

[70]  Barton P. Miller,et al.  Practical analysis of stripped binary code , 2005, CARN.

[71]  David Brumley,et al.  BitShred: Fast, Scalable Code Reuse Detection in Binary Code (CMU-CyLab-10-006) , 2007 .

[72]  Shai Ben-David,et al.  Measures of Clustering Quality: A Working Set of Axioms for Clustering , 2008, NIPS.

[73]  Eric Filiol,et al.  Behavioral detection of malware: from a survey towards an established taxonomy , 2008, Journal in Computer Virology.

[74]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[75]  Marcus A. Maloof,et al.  Learning to Detect and Classify Malicious Executables in the Wild , 2006, J. Mach. Learn. Res..

[76]  B. Karp,et al.  Autograph: Toward Automated, Distributed Worm Signature Detection , 2004, USENIX Security Symposium.

[77]  Somesh Jha,et al.  An architecture for generating semantics-aware signatures , 2005 .

[78]  Monireh Abdoos,et al.  A New Efficient Approach in Clustering Ensembles , 2007, IDEAL.

[79]  Zhenkai Liang,et al.  Automatic generation of buffer overflow attack signatures: an approach based on program behavior models , 2005, 21st Annual Computer Security Applications Conference (ACSAC'05).

[80]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.