论文信息 - Supervised learning for provenance-similarity of binaries

Supervised learning for provenance-similarity of binaries

Understanding, measuring, and leveraging the similarity of binaries (executable code) is a foundational challenge in software engineering. We present a notion of similarity based on provenance -- two binaries are similar if they are compiled from the same (or very similar) source code with the same (or similar) compilers. Empirical evidence suggests that provenance-similarity accounts for a significant portion of variation in existing binaries, particularly in malware. We propose and evaluate the applicability of classification to detect provenance-similarity. We evaluate a variety of classifiers, and different types of attributes and similarity labeling schemes, on two benchmarks derived from open-source software and malware respectively. We present encouraging results indicating that classification is a viable approach for automated provenance-similarity detection, and as an aid for malware analysts in particular.

Sagar Chaki | Arie Gurfinkel | Cory F. Cohen | A. Gurfinkel | S. Chaki

[1] Debin Gao,et al. BinHunt: Automatically Finding Semantic Differences in Binary Programs , 2008, ICICS.

[2] Thomas Dullien,et al. Graph-based comparison of Executable Objects , 2005 .

[3] Andrew Walenstein,et al. Exploiting Similarity Between Variants to Defeat Malware “ Vilo ” Method for Comparing and Searching Binary Programs , 2007 .

[4] Yong Chen,et al. Automatic malware categorization using cluster ensemble , 2010, KDD.

[5] Barton P. Miller,et al. Extracting compiler provenance from program binaries , 2010, PASTE '10.

[6] Stephen McCamant,et al. Binary Code Extraction and Interface Identification for Security Applications , 2009, NDSS.

[7] Andrew Walenstein,et al. The Software Similarity Problem in Malware Analysis , 2006, Duplication, Redundancy, and Similarity in Software.

[8] Daniel J. Quinlan,et al. Detecting code clones in binary executables , 2009, ISSTA.

[9] Hyun-il Lim,et al. A Static Birthmark of Binary Executables Based on API Call Structure , 2007, ASIAN.

[10] Andrew Walenstein,et al. Evaluation of malware phylogeny modelling systems using automated variant generation , 2009, Journal in Computer Virology.

[11] Leo Breiman,et al. Random Forests , 2001, Machine Learning.

[12] Halvar Flake,et al. Structural Comparison of Executable Objects , 2004, DIMVA.

[13] Kang G. Shin,et al. Large-scale malware indexing using function-call graphs , 2009, CCS.

[14] Stacy J. Prowell,et al. Computing the behavior of malicious code with function extraction technology , 2009, CSIIRW '09.

[15] Michael Meier,et al. Measuring similarity of malware behavior , 2009, 2009 IEEE 34th Conference on Local Computer Networks.