Supervised learning for provenance-similarity of binaries

Understanding, measuring, and leveraging the similarity of binaries (executable code) is a foundational challenge in software engineering. We present a notion of similarity based on provenance -- two binaries are similar if they are compiled from the same (or very similar) source code with the same (or similar) compilers. Empirical evidence suggests that provenance-similarity accounts for a significant portion of variation in existing binaries, particularly in malware. We propose and evaluate the applicability of classification to detect provenance-similarity. We evaluate a variety of classifiers, and different types of attributes and similarity labeling schemes, on two benchmarks derived from open-source software and malware respectively. We present encouraging results indicating that classification is a viable approach for automated provenance-similarity detection, and as an aid for malware analysts in particular.

[1]  Debin Gao,et al.  BinHunt: Automatically Finding Semantic Differences in Binary Programs , 2008, ICICS.

[2]  Thomas Dullien,et al.  Graph-based comparison of Executable Objects , 2005 .

[3]  Andrew Walenstein,et al.  Exploiting Similarity Between Variants to Defeat Malware “ Vilo ” Method for Comparing and Searching Binary Programs , 2007 .

[4]  Yong Chen,et al.  Automatic malware categorization using cluster ensemble , 2010, KDD.

[5]  Barton P. Miller,et al.  Extracting compiler provenance from program binaries , 2010, PASTE '10.

[6]  Stephen McCamant,et al.  Binary Code Extraction and Interface Identification for Security Applications , 2009, NDSS.

[7]  Andrew Walenstein,et al.  The Software Similarity Problem in Malware Analysis , 2006, Duplication, Redundancy, and Similarity in Software.

[8]  Daniel J. Quinlan,et al.  Detecting code clones in binary executables , 2009, ISSTA.

[9]  Hyun-il Lim,et al.  A Static Birthmark of Binary Executables Based on API Call Structure , 2007, ASIAN.

[10]  Andrew Walenstein,et al.  Evaluation of malware phylogeny modelling systems using automated variant generation , 2009, Journal in Computer Virology.

[11]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[12]  Halvar Flake,et al.  Structural Comparison of Executable Objects , 2004, DIMVA.

[13]  Kang G. Shin,et al.  Large-scale malware indexing using function-call graphs , 2009, CCS.

[14]  Stacy J. Prowell,et al.  Computing the behavior of malicious code with function extraction technology , 2009, CSIIRW '09.

[15]  Michael Meier,et al.  Measuring similarity of malware behavior , 2009, 2009 IEEE 34th Conference on Local Computer Networks.