Estimating types in binaries using predictive modeling

Reverse engineering is an important tool in mitigating vulnerabilities in binaries. As a lot of software is developed in object-oriented languages, reverse engineering of object-oriented code is of critical importance. One of the major hurdles in reverse engineering binaries compiled from object-oriented code is the use of dynamic dispatch. In the absence of debug information, any dynamic dispatch may seem to jump to many possible targets, posing a significant challenge to a reverse engineer trying to track the program flow. We present a novel technique that allows us to statically determine the likely targets of virtual function calls. Our technique uses object tracelets – statically constructed sequences of operations performed on an object – to capture potential runtime behaviors of the object. Our analysis automatically pre-labels some of the object tracelets by relying on instances where the type of an object is known. The resulting type-labeled tracelets are then used to train a statistical language model (SLM) for each type.We then use the resulting ensemble of SLMs over unlabeled tracelets to generate a ranking of their most likely types, from which we deduce the likely targets of dynamic dispatches.We have implemented our technique and evaluated it over real-world C++ binaries. Our evaluation shows that when there are multiple alternative targets, our approach can drastically reduce the number of targets that have to be considered by a reverse engineer.

[1]  Herbert Bos,et al.  MemPick: A tool for data structure detection , 2013, 2013 20th Working Conference on Reverse Engineering (WCRE).

[2]  Hinrich Schütze,et al.  Part-of-Speech Tagging Using a Variable Memory Markov Model , 1994, ACL.

[3]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[4]  Thomas W. Reps,et al.  Analyzing Stripped Device-Driver Executables , 2008, TACAS.

[5]  Eran Yahav,et al.  Typestate-based semantic code search over partial programs , 2012, OOPSLA '12.

[6]  Thomas W. Reps,et al.  WYSINWYX: What you see is not what you eXecute , 2005, TOPL.

[7]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[8]  Joel Olson,et al.  Virtual Team Effectiveness And Sequence Of Conditions , 2012, BIOINFORMATICS 2012.

[9]  Tatsuya Akutsu,et al.  Protein homology detection using string alignment kernels , 2004, Bioinform..

[10]  Matthew B. Dwyer,et al.  Proceedings of the ACM international conference on Object oriented programming systems languages and applications , 2010 .

[11]  Derek Rayside Points-To Analysis , 2005 .

[12]  Mourad Debbabi,et al.  Static analysis of binary code to isolate malicious behaviors , 1999, Proceedings. IEEE 8th International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises (WET ICE'99).

[13]  Somesh Jha,et al.  A semantics-based approach to malware detection , 2008, TOPL.

[14]  Jean-Philippe Vert,et al.  The context-tree kernel for strings , 2005, Neural Networks.

[15]  David Brumley,et al.  TIE: Principled Reverse Engineering of Types in Binary Programs , 2011, NDSS.

[16]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[17]  David Melski,et al.  Data-Delineation in Software Binaries and its Application to Buffer-Overrun Discovery , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[18]  David Brumley,et al.  BAP: A Binary Analysis Platform , 2011, CAV.

[19]  Yaniv David,et al.  Tracelet-based code search in executables , 2014, PLDI.

[20]  Thomas W. Reps,et al.  DIVINE: DIscovering Variables IN Executables , 2007, VMCAI.

[21]  Golan Yona,et al.  Variations on probabilistic suffix trees: statistical modeling and prediction of protein families , 2001, Bioinform..

[22]  Herbert Bos,et al.  MemPick: High-level data structure detection in C/C++ binaries , 2013, 2013 20th Working Conference on Reverse Engineering (WCRE).

[23]  Richard Ford,et al.  Probabilistic suffix models for API sequence analysis of Windows XP applications , 2008, Pattern Recognit..

[24]  Stefan Bygde 2 What You See Is Not What You Execute , 2011 .

[25]  Thomas W. Reps,et al.  A Next-Generation Platform for Analyzing Executables , 2005, APLAS.

[26]  David Brumley,et al.  Native x86 Decompilation Using Semantics-Preserving Structural Analysis and Iterative Control-Flow Structuring , 2013, USENIX Security Symposium.

[27]  David Brumley,et al.  BYTEWEIGHT: Learning to Recognize Functions in Binary Code , 2014, USENIX Security Symposium.

[28]  Sriram K. Rajamani,et al.  Thorough static analysis of device drivers , 2006, EuroSys.

[29]  Somesh Jha,et al.  A semantics-based approach to malware detection , 2007, POPL '07.

[30]  Ran El-Yaniv,et al.  Towards Behaviometric Security Systems: Learning to Identify a Typist , 2003, PKDD.

[31]  Wolfram Amme,et al.  Data Dependence Analysis of Assembly Code , 2004, International Journal of Parallel Programming.

[32]  Saumya K. Debray,et al.  Alias analysis of executable code , 1998, POPL '98.

[33]  James Pustejovsky Proceedings of the 32nd annual meeting on Association for Computational Linguistics , 1994 .

[34]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[35]  Benjamin Livshits,et al.  Practical static analysis of JavaScript applications in the presence of frameworks and libraries , 2013, ESEC/FSE 2013.

[36]  Anssi Klapuri,et al.  Labelling the Structural Parts of a Music Piece with Markov Models , 2009, CMMR.

[37]  Easwaran Raman,et al.  Practical and accurate low-level pointer analysis , 2005, International Symposium on Code Generation and Optimization.

[38]  Matthew V. Mahoney,et al.  Adaptive weighing of context models for lossless data compression , 2005 .

[39]  Alistair Moffat,et al.  Implementing the PPM data compression scheme , 1990, IEEE Trans. Commun..

[40]  Somesh Jha,et al.  Markov chains, classifiers, and intrusion detection , 2001, Proceedings. 14th IEEE Computer Security Foundations Workshop, 2001..

[41]  David F. Bacon,et al.  Fast static analysis of C++ virtual function calls , 1996, OOPSLA '96.

[42]  Ran El-Yaniv,et al.  Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition , 2006, J. Mach. Learn. Res..

[43]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[44]  Andreas Krause,et al.  Predicting Program Properties from "Big Code" , 2015, POPL.

[45]  Barak A. Pearlmutter,et al.  Detecting intrusions using system calls: alternative data models , 1999, Proceedings of the 1999 IEEE Symposium on Security and Privacy (Cat. No.99CB36344).

[46]  Thomas W. Reps,et al.  Improved Memory-Access Analysis for x86 Executables , 2008, CC.

[47]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[48]  Thomas W. Reps,et al.  There's Plenty of Room at the Bottom: Analyzing and Verifying Machine Code , 2010, CAV.

[49]  Sorin Lerner,et al.  SafeDispatch: Securing C++ Virtual Calls from Memory Corruption Attacks , 2014, NDSS.

[50]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[51]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[52]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[53]  Bjarne Steensgaard,et al.  Points-to analysis in almost linear time , 1996, POPL '96.

[54]  Frank Tip,et al.  Aggregate structure identification and its application to program analysis , 1999, POPL '99.

[55]  Qi He,et al.  Web Query Recommendation via Sequential Query Prediction , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[56]  Raphail E. Krichevsky,et al.  The performance of universal encoding , 1981, IEEE Trans. Inf. Theory.

[57]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[58]  Thomas W. Reps,et al.  Intermediate-representation recovery from low-level code , 2006, PEPM '06.

[59]  Somesh Jha,et al.  Dynamic Behavior Matching: A Complexity Analysis and New Approximation Algorithms , 2011, CADE.