Java Decompiler Diversity and its Application to Meta-decompilation

During compilation from Java source code to bytecode, some information is irreversibly lost. In other words, compilation and decompilation of Java code is not symmetric. Consequently, decompilation, which aims at producing source code from bytecode, relies on strategies to reconstruct the information that has been lost. Different Java decompilers use distinct strategies to achieve proper decompilation. In this work, we hypothesize that the diverse ways in which bytecode can be decompiled has a direct impact on the quality of the source code produced by decompilers. In this paper, we assess the strategies of eight Java decompilers with respect to three quality indicators: syntactic correctness, syntactic distortion and semantic equivalence modulo inputs. Our results show that no single modern decompiler is able to correctly handle the variety of bytecode structures coming from real-world programs. The highest ranking decompiler in this study produces syntactically correct, and semantically equivalent code output for 84%, respectively 78%, of the classes in our dataset. Our results demonstrate that each decompiler correctly handles a different set of bytecode classes. We propose a new decompiler called Arlecchino that leverages the diversity of existing decompilers. To do so, we merge partial decompilation into a new one based on compilation errors. Arlecchino handles 37.6% of bytecode classes that were previously handled by no decompiler. We publish the sources of this new bytecode decompiler.

[1]  Jens Krinke,et al.  Using compilation/decompilation to enhance clone detection , 2017, 2017 IEEE 11th International Workshop on Software Clones (IWSC).

[2]  Renaud Pawlak,et al.  SPOON: A library for implementing analyses and transformations of Java source code , 2016, Softw. Pract. Exp..

[3]  Yuandong Tian,et al.  Coda: An End-to-End Neural Program Decompiler , 2019, NeurIPS.

[4]  Dominik Stoffel,et al.  Speculative disassembly of binary code , 2016, 2016 International Conference on Compliers, Architectures, and Sythesis of Embedded Systems (CASES).

[5]  Matias Martinez,et al.  Fine-grained and accurate source code differencing , 2014, ASE.

[6]  Sebastian Danicic,et al.  An Evaluation of Current Java Bytecode Decompilers , 2009, 2009 Ninth IEEE International Working Conference on Source Code Analysis and Manipulation.

[7]  Laurie Hendren,et al.  Decompiling Java Bytecode: Problems, Traps and Pitfalls , 2002, CC.

[8]  Petr Zemek,et al.  PsybOt malware: A step-by-step decompilation case study , 2013, 2013 20th Working Conference on Reverse Engineering (WCRE).

[9]  Matt Noonan,et al.  Evolving Exact Decompilation , 2018 .

[10]  Eric M. Schulte,et al.  Datalog Disassembly , 2019, USENIX Security Symposium.

[11]  Ben Hermann,et al.  SootDiff Bytecode Comparison Across Different Java Compilers , 2019 .

[12]  Eric Schulte,et al.  Using recurrent neural networks for decompilation , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[13]  Graham Neubig,et al.  DIRE: A Neural Approach to Decompiled Identifier Naming , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[14]  Oren Etzioni,et al.  The MetaCrawler architecture for resource aggregation on the Web , 1997 .

[15]  Laurie Hendren,et al.  Soot: a Java bytecode optimization framework , 2010, CASCON.

[16]  Zhendong Su,et al.  Compiler validation via equivalence modulo inputs , 2014, PLDI.

[17]  Laurie J. Hendren,et al.  Metrics for Measuring the Effectiveness of Decompilers and Obfuscators , 2007, 15th IEEE International Conference on Program Comprehension (ICPC '07).

[18]  Gregorio Robles,et al.  An Empirical Approach to Software Archaeology , 2005 .

[19]  Yannis Smaragdakis,et al.  Gigahorse: Thorough, Declarative Decompilation of Smart Contracts , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[20]  Baowen Xu,et al.  Hunting for Bugs in Code Coverage Tools via Randomized Differential Testing , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[21]  Michael Van Emmerik,et al.  Static single assignment for decompilation , 2007 .

[22]  Konstantins Gusarovs,et al.  An Analysis on Java Programming Language Decompiler Capabilities , 2018, Appl. Comput. Syst..

[23]  Khaled Yakdan,et al.  Helping Johnny to Analyze Malware: A Usability-Optimized Decompiler and Malware Analysis User Study , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[24]  Ross Tate,et al.  Java and scala's type systems are unsound: the existential crisis of null pointers , 2016, OOPSLA.

[25]  Hyoungshick Kim,et al.  Kerberoid: A Practical Android App Decompilation System with Multiple Decompilers , 2019, CCS.

[26]  Linda M. Wills,et al.  An experimentation framework for evaluating disassembly and decompilation tools for C++ and java , 2003, 10th Working Conference on Reverse Engineering, 2003. WCRE 2003. Proceedings..

[27]  Jozef Kostelanský,et al.  An evaluation of output from current Java bytecode decompilers: Is it Android which is responsible for such quality boost? , 2017, 2017 Communication and Information Technologies (KIT).

[28]  Frank Yellin,et al.  The Java Virtual Machine Specification , 1996 .

[29]  Amer Diwan,et al.  The DaCapo benchmarks: java benchmarking development and analysis , 2006, OOPSLA '06.

[30]  Mingzhe Wang,et al.  EnFuzz: Ensemble Fuzzing with Seed Synchronization among Diverse Fuzzers , 2018, USENIX Security Symposium.

[31]  Yi Sun,et al.  Probabilistic Disassembly , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[32]  Stephen McCamant,et al.  Binary Mutation Analysis of Tests Using Reassembleable Disassembly , 2019, Proceedings 2019 Workshop on Binary Analysis Research.

[33]  Dinghao Wu,et al.  Reassembleable Disassembling , 2015, USENIX Security Symposium.

[34]  Kun Qian,et al.  Adabot: Fault-Tolerant Java Decompiler , 2019, ArXiv.

[35]  Anil Somayaji,et al.  Object-level recombination of commodity applications , 2010, GECCO '10.

[36]  Katerina Troshina,et al.  Reconstruction of Composite Types for Decompilation , 2010, 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation.