Using Reduced Execution Flow Graph to Identify Library Functions in Binary Code

Discontinuity and polymorphism of a library function create two challenges for library function identification, which is a key technique in reverse engineering. A new hybrid representation of dependence graph and control flow graph called Execution Flow Graph (EFG) is introduced to describe the semantics of binary code. Library function identification turns to be a subgraph isomorphism testing problem since the EFG of a library function instance is isomorphic to the sub-EFG of this library function. Subgraph isomorphism detection is time-consuming. Thus, we introduce a new representation called Reduced Execution Flow Graph (REFG) based on EFG to speed up the isomorphism testing. We have proved that EFGs are subgraph isomorphic as long as their corresponding REFGs are subgraph isomorphic. The high efficiency of the REFG approach in subgraph isomorphism detection comes from fewer nodes and edges in REFGs and new lossless filters for excluding the unmatched subgraphs before detection. Experimental results show that precisions of both the EFG and REFG approaches are higher than the state-of-the-art tool and the REFG approach sharply decreases the processing time of the EFG approach with consistent precision and recall.

[1]  Susan Horwitz,et al.  Using Slicing to Identify Duplication in Source Code , 2001, SAS.

[2]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[3]  Xiaohong Su,et al.  Library functions identification in binary code by using graph isomorphism testings , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[4]  Mattia Monga,et al.  Detecting Self-mutating Malware Using Control-Flow Graph Matching , 2006, DIMVA.

[5]  P. Foggia,et al.  Performance evaluation of the VF graph matching algorithm , 1999, Proceedings 10th International Conference on Image Analysis and Processing.

[6]  Tzi-cker Chiueh,et al.  Automatic Generation of String Signatures for Malware Detection , 2009, RAID.

[7]  Yaniv David,et al.  Tracelet-based code search in executables , 2014, PLDI.

[8]  David A. Padua,et al.  Dependence graphs and compiler optimizations , 1981, POPL '81.

[9]  Daniel J. Quinlan,et al.  Detecting code clones in binary executables , 2009, ISSTA.

[10]  Barton P. Miller,et al.  Labeling library functions in stripped binaries , 2011, PASTE '11.

[11]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[12]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[13]  Gurindar S. Sohi,et al.  Master/slave speculative parallelization and approximate code , 2002 .

[14]  David W. Binkley,et al.  Interprocedural slicing using dependence graphs , 1990, TOPL.

[15]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[16]  Mike Emmerik Signatures for Library Functions in Executable Files , 1994 .

[17]  Lingyu Wang,et al.  SIGMA: A Semantic Integrated Graph Matching Approach for identifying reused functions in binary code , 2015, Digit. Investig..

[18]  David L. Kuck,et al.  The Structure of Computers and Computations , 1978 .

[19]  Jens Krinke,et al.  Identifying similar code with program dependence graphs , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[20]  Philip S. Yu,et al.  GPLAG: detection of software plagiarism by program dependence graph analysis , 2006, KDD '06.

[21]  Christopher Krügel,et al.  Polymorphic Worm Detection Using Structural Information of Executables , 2005, RAID.

[22]  Aaron Blankstein Parallel Subgraph Isomorphism , 2010 .