MAMBO: A Low-Overhead Dynamic Binary Modification Tool for ARM

As the ARM architecture expands beyond its traditional embedded domain, there is a growing interest in dynamic binary modification (DBM) tools for general-purpose multicore processors that are part of the ARM family. Existing DBM tools for ARM suffer from introducing large overheads in the execution of applications. The specific questions that this article addresses are (i) how to develop such DBM tools for the ARM architecture and (ii) whether new optimisations are plausible and needed. We describe the general design of MAMBO, a new DBM tool for ARM, which we release together with this publication, and introduce novel optimisations to handle indirect branches. In addition, we explore scenarios in which it may be possible to relax the transparency offered by DBM tools to allow extra optimisations to be applied. These scenarios arise from analysing the most typical usages: for example, application binaries without handcrafted assembly. The performance evaluation shows that MAMBO introduces small overheads for SPEC CPU2006 and PARSEC 3.0 when comparing with the execution times of the unmodified programs: a geometric mean overhead of 28p on a Cortex-A9 and of 34p on a Cortex-A15 for CPU2006, and between 27p and 32p, depending on the number of threads, for PARSEC on a Cortex-A15.

[1]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[2]  Weng-Fai Wong,et al.  Dynamic cache contention detection in multi-threaded applications , 2011, VEE '11.

[3]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[4]  Derek Bruening,et al.  Efficient, transparent, and comprehensive runtime code manipulation , 2004 .

[5]  Richard Johnson,et al.  The Transmeta Code Morphing/spl trade/ Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[6]  Chien-Min Wang,et al.  HQEMU: a multi-threaded and retargetable dynamic binary translator on multicores , 2012, CGO '12.

[7]  Dirk Grunwald,et al.  Identifying potential parallelism via loop-centric profiling , 2007, CF '07.

[8]  Ole Agesen,et al.  A comparison of software and hardware techniques for x86 virtualization , 2006, ASPLOS XII.

[9]  James E. Smith,et al.  Hardware Support for Control Transfers in Code Caches , 2003, MICRO.

[10]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[11]  Nicholas Nethercote,et al.  Using Valgrind to Detect Undefined Value Errors with Bit-Precision , 2005, USENIX Annual Technical Conference, General Track.

[12]  Jonathan S. Shapiro,et al.  HDTrans: a low-overhead dynamic translator , 2007, CARN.

[13]  Jon Watson,et al.  VirtualBox: bits and bytes masquerading as machines , 2008 .

[14]  David A. Wagner,et al.  The Performance Cost of Shadow Stacks and Stack Canaries , 2015, AsiaCCS.

[15]  Thomas R. Gross,et al.  Generating low-overhead dynamic binary translators , 2010, SYSTOR '10.

[16]  Richard Johnson,et al.  The Transmeta Code Morphing#8482; Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, CGO.

[17]  Gary Brown,et al.  Denver: Nvidia's First 64-bit ARM Processor , 2015, IEEE Micro.

[18]  AgesenOle,et al.  A comparison of software and hardware techniques for x86 virtualization , 2006 .

[19]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX Annual Technical Conference, FREENIX Track.

[20]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[21]  Tadao Nakamura,et al.  On-the-fly detection of precise loop nests across procedures on a dynamic binary translation system , 2011, CF '11.

[22]  Kim M. Hazelwood,et al.  A dynamic binary instrumentation engine for the ARM architecture , 2006, CASES '06.

[23]  Vasanth Bala,et al.  Dynamo: a transparent dynamic optimization system , 2000, SIGP.