Optimizing Indirect Branches in Dynamic Binary Translators

Dynamic binary translation is a technology for transparently translating and modifying a program at the machine code level as it is running. A significant factor in the performance of a dynamic binary translator is its handling of indirect branches. Unlike direct branches, which have a known target at translation time, an indirect branch requires translating a source program counter address to a translated program counter address every time the branch is executed. This translation can impose a serious runtime penalty if it is not handled efficiently. MAMBO-X64, a dynamic binary translator that translates 32-bit ARM (AArch32) code to 64-bit ARM (AArch64) code, uses three novel techniques to improve the performance of indirect branch translation. Together, these techniques allow MAMBO-X64 to achieve a very low performance overhead of only 10% on average compared to native execution of 32-bit programs. Hardware-assisted function returns use a software return address stack to predict the targets of function returns, making use of several novel optimizations while also exploiting hardware return address prediction. This technique has a significant impact on most benchmarks, reducing binary translation overhead compared to native execution by 40% on average and by 90% on some benchmarks. Branch table inference, an algorithm for detecting and translating branch tables, can reduce the overhead of translated code by up to 40% on some SPEC CPU2006 benchmarks. The remaining indirect branches are handled using a fast atomic hash table, which is optimized to work with multiple threads. This last technique translates indirect branches using a single shared hash table while avoiding expensive synchronization in performance-critical lookup code. This allows the performance to be on par with thread-private hash tables while having superior memory scalability.

[1]  Kim M. Hazelwood,et al.  Scalable support for multithreaded applications on dynamic binary instrumentation systems , 2009, ISMM '09.

[2]  Ole Agesen,et al.  A comparison of software and hardware techniques for x86 virtualization , 2006, ASPLOS XII.

[3]  Jing Wang,et al.  SPIRE: improving dynamic binary translation through SPC-indexed indirect branch redirecting , 2013, VEE '13.

[4]  Richard Johnson,et al.  The Transmeta Code Morphing#8482; Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, CGO.

[5]  James E. Smith,et al.  Hardware Support for Control Transfers in Code Caches , 2003, MICRO.

[6]  R. Nigel Horspool,et al.  An Approach to the Problem of Detranslation of Computer Programs , 1980, Comput. J..

[7]  Tadao Nakamura,et al.  On-the-fly detection of precise loop nests across procedures on a dynamic binary translation system , 2011, CF '11.

[8]  Jon Watson,et al.  VirtualBox: bits and bytes masquerading as machines , 2008 .

[9]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX ATC, FREENIX Track.

[10]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[11]  Derek Bruening,et al.  Thread-shared software code caches , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[12]  Yu He,et al.  SPTU: Improving Dynamic Binary Translation through Software Prediction with Target Updating , 2014, SYSTOR 2014.

[13]  Mary Lou Soffa,et al.  Overhead reduction techniques for software dynamic translation , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[14]  Jack W. Davidson,et al.  Addressing the challenges of DBT for the ARM architecture , 2009, LCTES '09.

[15]  Dirk Grunwald,et al.  Identifying potential parallelism via loop-centric profiling , 2007, CF '07.

[16]  Nicholas Nethercote,et al.  Using Valgrind to Detect Undefined Value Errors with Bit-Precision , 2005, USENIX Annual Technical Conference, General Track.

[17]  Kim M. Hazelwood,et al.  A dynamic binary instrumentation engine for the ARM architecture , 2006, CASES '06.

[18]  Thomas R. Gross,et al.  Generating low-overhead dynamic binary translators , 2010, SYSTOR '10.

[19]  Qin Zhao,et al.  Transparent dynamic instrumentation , 2012, VEE '12.

[20]  Yu He,et al.  DTT: program structure-aware indirect branch optimization via direct-TPC-table in DBT system , 2014, Conf. Computing Frontiers.

[21]  Derek Bruening,et al.  Efficient, transparent, and comprehensive runtime code manipulation , 2004 .

[22]  Raymond J. Hookway,et al.  DIGITAL FX!32: Combining Emulation and Binary Translation , 1997, Digit. Tech. J..

[23]  Jonathan S. Shapiro,et al.  HDTrans: a low-overhead dynamic translator , 2007, CARN.

[24]  Weng-Fai Wong,et al.  Dynamic cache contention detection in multi-threaded applications , 2011, VEE '11.

[25]  Derek Bruening,et al.  An infrastructure for adaptive dynamic optimization , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[26]  Wei Hu,et al.  Evaluating Indirect Branch Handling Mechanisms in Software Dynamic Translation Systems , 2007, CGO.

[27]  Richard Johnson,et al.  The Transmeta Code Morphing/spl trade/ Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[28]  AgesenOle,et al.  A comparison of software and hardware techniques for x86 virtualization , 2006 .