Optimising dynamic binary modification across 64-bit Arm microarchitectures

A common optimisation used in most Dynamic Binary Modification (DBM) systems is trace generation as these traces improve locality and code layout. We describe an optimised code layout for traces as well as present how to adapt the runtime algorithm to generate it. In this way, we manage to reduce the overhead on all the Arm systems evaluated; 5 different microarchitectures. A major source of overhead for DBMs comes from handling indirect branches. Indirect Branch Inlining (IBI) is a mechanism that attempts to avoid this overhead by using predictions about the target of the indirect branch. We analyse the behaviour of the indirect branch inlining and propose a new predictor, Trace Restricted IBI (TRIBI), and how to optimise IBI given the new trace generation algorithm. Our evaluation shows a geometric mean overhead for SPEC CPU2006 of 9% for a Cortex-A53 (in-order core), and for out-of-order cores 11% on an X-Gene-2, 10% on a Cortex-A57, 7% on a Cortex-A72 and 8% on a Cortex-A73, when compared to native execution. This is a reduction of the overhead between 30% to 50% compared to the publicly available DBM systems MAMBO, and, even higher, against DynamoRIO. Using PARSEC 3.0, we evaluate the scalability across threads on the X-Gene-2 system (server machine with the highest number of cores) and show a geomean overhead between 6--8%.

[1]  E. Duesterwald,et al.  Software profiling for hot path prediction: less is more , 2000, SIGP.

[2]  M. Probst Dynamic Binary Translation , 2003 .

[3]  Wei Hu,et al.  Evaluating Indirect Branch Handling Mechanisms in Software Dynamic Translation Systems , 2007, CGO.

[4]  Ravi Nair,et al.  System Virtual Machines , 2005 .

[5]  Kim M. Hazelwood,et al.  A dynamic binary instrumentation engine for the ARM architecture , 2006, CASES '06.

[6]  Vasanth Bala,et al.  Dynamo: a transparent dynamic optimization system , 2000, SIGP.

[7]  Mikel Luján,et al.  Low overhead dynamic binary translation on ARM , 2017, PLDI.

[8]  L. Peter Deutsch,et al.  Efficient implementation of the smalltalk-80 system , 1984, POPL.

[9]  Karel Driesen,et al.  Accurate indirect branch prediction , 1998, ISCA.

[10]  Balaji Dhanasekaran,et al.  Improving Indirect Branch Translation in Dynamic Binary Translators , 2011 .

[11]  Cloyce D. Spradling SPEC CPU2006 benchmark tools , 2007, CARN.

[12]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX Annual Technical Conference, FREENIX Track.

[13]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[14]  Derek Bruening,et al.  Efficient, transparent, and comprehensive runtime code manipulation , 2004 .

[15]  Nicholas Nethercote,et al.  Dynamic Binary Analysis and Instrumentation , 2004 .

[16]  Jim D. Garside,et al.  Optimizing Indirect Branches in Dynamic Binary Translators , 2016, ACM Trans. Archit. Code Optim..

[17]  James E. Smith,et al.  Virtual machines - versatile platforms for systems and processes , 2005 .

[18]  Theo Ungerer,et al.  Dynamic branch prediction and control speculation , 2007, Int. J. High Perform. Syst. Archit..

[19]  Mikel Luján,et al.  Optimising Dynamic Binary Modification Across ARM Microarchitectures , 2017, ICPE.

[20]  No License,et al.  Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[21]  Jack W. Davidson,et al.  Strata: A Software Dynamic Translation Infrastructure , 2001 .

[22]  Mikel Luján,et al.  MAMBO: A Low-Overhead Dynamic Binary Modification Tool for ARM , 2016, ACM Trans. Archit. Code Optim..

[23]  James E. Smith,et al.  Hardware Support for Control Transfers in Code Caches , 2003, MICRO.