Statistical Reconstruction of Class Hierarchies in Binaries

We address a fundamental problem in reverse engineering of object-oriented code: the reconstruction of a program's class hierarchy from its stripped binary. Existing approaches rely heavily on structural information that is not always available, e.g., calls to parent constructors. As a result, these approaches often leave gaps in the hierarchies they construct, or fail to construct them altogether. Our main insight is that behavioral information can be used to infer subclass/superclass relations, supplementing any missing structural information. Thus, we propose the first statistical approach for static reconstruction of class hierarchies based on behavioral similarity. We capture the behavior of each type using a statistical language model (SLM), define a metric for pairwise similarity between types based on the Kullback-Leibler divergence between their SLMs, and lift it to determine the most likely class hierarchy. We implemented our approach in a tool called ROCK and used it to automatically reconstruct the class hierarchies of several real-world stripped C++ binaries. Our results demonstrate that ROCK obtained significantly more accurate class hierarchies than those obtained using structural analysis alone.

[1]  Barak A. Pearlmutter,et al.  Detecting intrusions using system calls: alternative data models , 1999, Proceedings of the 1999 IEEE Symposium on Security and Privacy (Cat. No.99CB36344).

[2]  Egor Derevenetc,et al.  SmartDec: Approaching C++ Decompilation , 2011, 2011 18th Working Conference on Reverse Engineering.

[3]  Thomas W. Reps,et al.  Improved Memory-Access Analysis for x86 Executables , 2008, CC.

[4]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[5]  Thomas W. Reps,et al.  There's Plenty of Room at the Bottom: Analyzing and Verifying Machine Code , 2010, CAV.

[6]  Mourad Debbabi,et al.  Static analysis of binary code to isolate malicious behaviors , 1999, Proceedings. IEEE 8th International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises (WET ICE'99).

[7]  Alistair Moffat,et al.  Implementing the PPM data compression scheme , 1990, IEEE Trans. Commun..

[8]  Somesh Jha,et al.  Markov chains, classifiers, and intrusion detection , 2001, Proceedings. 14th IEEE Computer Security Foundations Workshop, 2001..

[9]  David Brumley,et al.  BYTEWEIGHT: Learning to Recognize Functions in Binary Code , 2014, USENIX Security Symposium.

[10]  Mathias Payer,et al.  Control-Flow Integrity , 2017, ACM Comput. Surv..

[11]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[12]  Benjamin Livshits,et al.  Practical static analysis of JavaScript applications in the presence of frameworks and libraries , 2013, ESEC/FSE 2013.

[13]  Yaniv David,et al.  Tracelet-based code search in executables , 2014, PLDI.

[14]  Somesh Jha,et al.  A semantics-based approach to malware detection , 2007, POPL '07.

[15]  David Brumley,et al.  Native x86 Decompilation Using Semantics-Preserving Structural Analysis and Iterative Control-Flow Structuring , 2013, USENIX Security Symposium.

[16]  Wolfram Amme,et al.  Data Dependence Analysis of Assembly Code , 2004, International Journal of Parallel Programming.

[17]  Thomas W. Reps,et al.  Intermediate-representation recovery from low-level code , 2006, PEPM '06.

[18]  Thomas W. Reps,et al.  A Next-Generation Platform for Analyzing Executables , 2005, APLAS.

[19]  Zhiqiang Lin,et al.  Type Inference on Executables , 2016, ACM Comput. Surv..

[20]  Eran Yahav,et al.  Typestate-based semantic code search over partial programs , 2012, OOPSLA '12.

[21]  Thomas W. Reps,et al.  WYSINWYX: What you see is not what you eXecute , 2005, TOPL.

[22]  David Brumley,et al.  BAP: A Binary Analysis Platform , 2011, CAV.

[23]  Thomas W. Reps,et al.  DIVINE: DIscovering Variables IN Executables , 2007, VMCAI.

[24]  Golan Yona,et al.  Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. , 2002, Journal of molecular biology.

[25]  Saumya K. Debray,et al.  Alias analysis of executable code , 1998, POPL '98.

[26]  Sriram K. Rajamani,et al.  Thorough static analysis of device drivers , 2006, EuroSys.

[27]  Ran El-Yaniv,et al.  Agnostic Classification of Markovian Sequences , 1997, NIPS.

[28]  Ran El-Yaniv,et al.  Estimating types in binaries using predictive modeling , 2016, POPL.

[29]  Richard Ford,et al.  Probabilistic suffix models for API sequence analysis of Windows XP applications , 2008, Pattern Recognit..

[30]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[31]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[32]  Somesh Jha,et al.  Dynamic Behavior Matching: A Complexity Analysis and New Approximation Algorithms , 2011, CADE.

[33]  Venkatesh Karthik Srinivasan,et al.  Software-Architecture Recovery from Machine Code ∗ , 2013 .

[34]  Herbert Bos,et al.  MARX: Uncovering Class Hierarchies in C++ Programs , 2017, NDSS.

[35]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[36]  Mona Singh,et al.  Predicting functionally important residues from sequence conservation , 2007, Bioinform..

[37]  David Melski,et al.  Data-Delineation in Software Binaries and its Application to Buffer-Overrun Discovery , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[38]  Stefano Zanero,et al.  Jackdaw: Towards Automatic Reverse Engineering of Large Datasets of Binaries , 2015, DIMVA.

[39]  Hinrich Schütze,et al.  Part-of-Speech Tagging Using a Variable Memory Markov Model , 1994, ACL.

[40]  Thomas W. Reps,et al.  Recovery of Class Hierarchies and Composition Relationships from Machine Code , 2014, CC.

[41]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[42]  Andreas Krause,et al.  Predicting Program Properties from "Big Code" , 2015, POPL.

[43]  G. Broll,et al.  Microsoft Corporation , 1999 .