Finding Semantically-Equivalent Binary Code By Synthesizing Adaptors

Independently developed codebases typically contain many segments of code that perform the same or closely related operations (semantic clones). Finding functionally equivalent segments enables applications like replacing a segment by a more efficient or more secure alternative. Such related segments often have different interfaces, so some glue code (an adaptor) is needed to replace one with the other. We present an algorithm that searches for replaceable code segments at the function level by attempting to synthesize an adaptor between them from some family of adaptors; it terminates if it finds no possible adaptor. We implement our technique using (1) concrete adaptor enumeration based on Intel's Pin framework and (2) binary symbolic execution, and explore the relation between size of adaptor search space and total search time. We present examples of applying adaptor synthesis for improving security and efficiency of binary functions, deobfuscating binary functions, and switching between binary implementations of RC4. For large-scale evaluation, we run adaptor synthesis on more than 13,000 function pairs from the Linux C library. Our results confirm that several instances of adaptably equivalent binary functions exist in real-world code, and suggest that these functions can be used to construct cleaner, less buggy, more efficient programs.

[1]  Urs Hölzle,et al.  Binary Component Adaptation , 1997, ECOOP.

[2]  Colin Runciman,et al.  Retrieving Reusable Software Components by Polymorphic Type , 1991, J. Funct. Program..

[3]  Christian S. Collberg,et al.  Distributed application tamper detection via continuous software updates , 2012, ACSAC '12.

[4]  Zohar Manna,et al.  A Deductive Approach to Program Synthesis , 1979, TOPL.

[5]  Ramarathnam Venkatesan,et al.  The Superdiversifier: Peephole Individualization for Software Protection , 2008, IWSEC.

[6]  David L. Dill,et al.  A Decision Procedure for Bit-Vectors and Arrays , 2007, CAV.

[7]  Sanjit A. Seshia,et al.  Combinatorial sketching for finite programs , 2006, ASPLOS XII.

[8]  Henry S. Warren,et al.  Hacker's Delight , 2002 .

[9]  John Penix Toward Automated Component Adaptation , 2007 .

[10]  Dan Grossman,et al.  Automatic Transformation of Bit-Level C Code to Support Multiple Equivalent Data Layouts , 2008, CC.

[11]  Don Coppersmith,et al.  Matrix multiplication via arithmetic progressions , 1987, STOC.

[12]  V. Strassen Gaussian elimination is not optimal , 1969 .

[13]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[14]  Keith H. Randall,et al.  Denali: a goal-directed superoptimizer , 2002, PLDI '02.

[15]  Perry Alexander,et al.  SPARTACAS: automating component reuse and adaptation , 2004, IEEE Transactions on Software Engineering.

[16]  Daniel M. Yellin,et al.  Protocol specifications and component adaptors , 1997, TOPL.

[17]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[18]  Unix System Laboratories System V Application Binary Interface , 1993 .

[19]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[20]  Sumit Gulwani,et al.  Program verification using templates over predicate abstraction , 2009, PLDI '09.

[21]  Antonio Brogi,et al.  A formal approach to component adaptation , 2005, J. Syst. Softw..

[22]  Mark S. Boddy,et al.  Frankencode: Creating Diverse Programs Using Code Clones , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[23]  Koushik Sen,et al.  WISE: Automated test generation for worst-case complexity , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[24]  John Penix,et al.  Classification and retrieval of reusable components using semantic features , 1995, Proceedings 1995 10th Knowledge-Based Software Engineering Conference.

[25]  H. Chandler Practical , 1982, Digital Transformation of the Laboratory.

[26]  Sumit Gulwani,et al.  Oracle-guided component-based program synthesis , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[27]  James M. Purtilo,et al.  Module reuse by interface adaptation , 1991, Softw. Pract. Exp..

[28]  George Candea,et al.  Efficient state merging in symbolic execution , 2012, Software Engineering.

[29]  Ashish Tiwari,et al.  Template-based circuit understanding , 2014, 2014 Formal Methods in Computer-Aided Design (FMCAD).

[30]  Zhendong Su,et al.  Automatic mining of functionally equivalent code fragments via random testing , 2009, ISSTA.

[31]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[32]  Eran Yahav,et al.  Statistical similarity of binaries , 2016, PLDI.

[33]  Alessandra Gorla,et al.  Search-based synthesis of equivalent method sequences , 2014, SIGSOFT FSE.

[34]  Stephen McCamant,et al.  Path-exploration lifting: hi-fi tests for lo-fi emulators , 2012, ASPLOS XVII.

[35]  Andy Podgurski,et al.  Behavior sampling: a technique for automated retrieval of reusable components , 1992, International Conference on Software Engineering.

[36]  Yuanyuan Zhou,et al.  CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code , 2004, OSDI.

[37]  Colin Runciman,et al.  Retrieving re-usable software components by polymorphic type , 1989, Journal of Functional Programming.

[38]  Pascal Junod,et al.  Obfuscator-LLVM -- Software Protection for the Masses , 2015, 2015 IEEE/ACM 1st International Workshop on Software Protection.

[39]  David Brumley,et al.  Enhancing symbolic execution with veritesting , 2014, ICSE.

[40]  Jeannette M. Wing,et al.  Signature matching: a tool for using software libraries , 1995, TSEM.

[41]  Zhenkai Liang,et al.  BitBlaze: A New Approach to Computer Security via Binary Analysis , 2008, ICISS.

[42]  Mikael Rittri,et al.  Using types as search keys in function libraries , 1989, Journal of Functional Programming.

[43]  T. Laszlo,et al.  OBFUSCATING C++ PROGRAMS VIA CONTROL FLOW FLATTENING , 2009 .

[44]  François Le Gall,et al.  Powers of tensors and fast matrix multiplication , 2014, ISSAC.

[45]  Stelios Sidiroglou,et al.  Program fracture and recombination for efficient automatic code reuse , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[46]  Dominic Duggan,et al.  Type-based hot swapping of running modules , 2005, Acta Informatica.

[47]  Stephen McCamant,et al.  The Daikon system for dynamic detection of likely invariants , 2007, Sci. Comput. Program..

[48]  Todd A. Proebsting Optimizing an ANSI C interpreter with superoperators , 1995, POPL '95.