Upcompiling Legacy Code to Java

This thesis investigates the process of “upcompilation”, the transformation of a binary program back into source code. Unlike a decompiler, the resulting code is in a language with higher abstraction than the original source code was originally written in. Thus, it supports the migration of legacy applications with missing source code to a virtual machine. The result of the thesis is a deeper understanding of the problems occurring in upcompilers. To identify the problems, we wrote an upcompiler which transforms simple x86 binary programs to Java source code. We recover local variables, function arguments and return values from registers and memory. The expression reduction phase reduces the amount of variables. We detect calls to library functions and translate memory allocation and basic input/output operations to Java constructs. The structuring phase transforms the control flow graph to an abstract syntax tree. We type the variables to integers and pointers to integer. In order to optimize the produced code for readability, we developed a data flow aware coalescing algorithm. The discovered obstacles include type recovery, structuring, handling of obfuscated code, pointer representation in Java, and optimization for readability, to only name a few. For most of them we refer to related literature. We show that upcompilation is possible and where the problems are. More investigation and implementation effort is needed to tackle specific problems and to make upcompilation applicable for real world programs.

[1]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[2]  Maurice H. Halstead Machine-independence and third-generation computers , 1967, AFIPS '67 (Fall).

[3]  Timothy J. Harvey,et al.  AS imple, Fast Dominance Algorithm , 1999 .

[4]  A. V. Chernov,et al.  Automatic reconstruction of data types in the decompilation problem , 2009, Programming and Computer Software.

[5]  Herbert Bos,et al.  Dynamic data structure excavation , 2010 .

[6]  Michael Van Emmerik,et al.  Static single assignment for decompilation , 2007 .

[7]  Alan Mycroft,et al.  Type-Based Decompilation (or Program Reconstruction via Type Reconstruction) , 1999, ESOP.

[8]  Steve McConnell,et al.  Code complete - a practical handbook of software construction, 2nd Edition , 1993 .

[9]  John Cocke,et al.  Register Allocation Via Coloring , 1981, Comput. Lang..

[10]  John Cocke,et al.  A methodology for the real world , 1981 .

[11]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[12]  Keith D. Cooper,et al.  Engineering a Compiler , 2003 .

[13]  Ulrike Lichtblau Decompilation of Control Structures by Means of Graph Transformations , 1985, TAPSOFT, Vol.1.

[14]  Roy Dz-Ching Ju,et al.  Translating Out of Static Single Assignment Form , 1999, SAS.

[15]  Christopher Krügel,et al.  Static Disassembly of Obfuscated Binaries , 2004, USENIX Security Symposium.

[16]  Giovanni Vigna Static Disassembly and Code Analysis , 2007, Malware Detection.

[17]  George Candea,et al.  Enabling Sophisticated Analysis of x86 Binaries with RevGen , 2011, HotDep 2011.

[18]  Walter Binder,et al.  A refined decompiler to generate C code with high readability , 2013, Softw. Pract. Exp..

[19]  Ken Kennedy,et al.  AS imple, Fast Dominance Algorithm , 1999 .

[20]  Cristina Cifuentes,et al.  Reverse compilation techniques , 1994 .

[21]  Cristina Cifuentes,et al.  Recovery of jump table case statements from binary code , 2001, Sci. Comput. Program..

[22]  Herbert Bos,et al.  DDE: dynamic data structure excavation , 2010, APSys '10.

[23]  Adam Megacz,et al.  Complete translation of unsafe native code to safe bytecode , 2004, IVME '04.