HyperMAMBO-X64: Using Virtualization to Support High-Performance Transparent Binary Translation

Current computer architectures --- ARM, MIPS, PowerPC, SPARC, x86 --- have evolved from a 32-bit architecture to a 64-bit one. Computer architects often consider whether it could be possible to eliminate hardware support for a subset of the instruction set as to reduce hardware complexity, which could improve performance, reduce power usage and accelerate processor development. This paper considers the scenario where we want to eliminate 32-bit hardware support from the ARMv8 architecture. Dynamic binary translation can be used for this purpose and generally comes in one of two forms: application-level translators that translate a single user mode process on top of a native operating system, and system-level translators that translate an entire operating system and all its processes. Application-level translators can have good performance but is not totally transparent; system-level translators may be 100% compatible but performance suffers. HyperMAMBO-X64 uses a new approach that gets the best of both worlds, being able to run the translator as an application under the hypervisor but still react to the behavior of guest operating systems. It works with complete transparency with regards to the virtualized system whilst delivering performance close to that provided by hardware execution. A key factor in the low overhead of HyperMAMBO-X64 is its deep integration with the virtualization and memory management features of ARMv8. These are exploited to support caching of translations across multiple address spaces while ensuring that translated code remains consistent with the source instructions it is based on. We show how these attributes are achieved without sacrificing either performance or accuracy.

[1]  Parthasarathy Ranganathan,et al.  MagiXen: Combining Binary Translation and Virtualization , 2007 .

[2]  Cheng Wang,et al.  StarDBT: An Efficient Multi-platform Dynamic Binary Translation System , 2007, Asia-Pacific Computer Systems Architecture Conference.

[3]  Mikel Luján,et al.  Low overhead dynamic binary translation on ARM , 2017, PLDI.

[4]  Tzi-cker Chiueh,et al.  Evaluation of a Server-Grade Software-Only ARM Hypervisor , 2013, 2013 IEEE Sixth International Conference on Cloud Computing.

[5]  Richard Johnson,et al.  The Transmeta Code Morphing/spl trade/ Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[6]  M. Watheq El-Kharashi,et al.  Embedded Hypervisor Xvisor: A Comparative Analysis , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[7]  Norman Rubin,et al.  A Profile-Directed Binary Translator , 1998 .

[8]  П. Довгалюк,et al.  Два способа организации механизма полносистемного детерминированного воспроизведения в симуляторе QEMU , 2012 .

[9]  Nicholas Nethercote,et al.  Using Valgrind to Detect Undefined Value Errors with Bit-Precision , 2005, USENIX Annual Technical Conference, General Track.

[10]  Dirk Grunwald,et al.  Identifying potential parallelism via loop-centric profiling , 2007, CF '07.

[11]  Qin Zhao,et al.  Optimizing binary translation of dynamically generated code , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[12]  Stephen McCamant,et al.  A General Persistent Code Caching Framework for Dynamic Binary Translation (DBT) , 2016, USENIX Annual Technical Conference.

[13]  Yun Wang,et al.  IA-32 Execution Layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium-based systems , 2003, MICRO.

[14]  Jon Watson,et al.  VirtualBox: bits and bytes masquerading as machines , 2008 .

[15]  Ercan Ucan,et al.  A Choices Hypervisor on the ARM Architecture , 2006 .

[16]  Ole Agesen,et al.  A comparison of software and hardware techniques for x86 virtualization , 2006, ASPLOS XII.

[17]  Raymond J. Hookway,et al.  DIGITAL FX!32: Combining Emulation and Binary Translation , 1997, Digit. Tech. J..

[18]  David Seal,et al.  ARM Architecture Reference Manual , 2001 .

[19]  Vasanth Bala,et al.  Dynamo: a transparent dynamic optimization system , 2000, SIGP.

[20]  Richard Johnson,et al.  The Transmeta Code Morphing#8482; Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, CGO.

[21]  Lieven Eeckhout,et al.  Scheduling heterogeneous multi-cores through performance impact estimation (PIE) , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[22]  Derek Bruening,et al.  Process-shared and persistent code caches , 2008, VEE '08.

[23]  Jim D. Garside,et al.  Optimizing Indirect Branches in Dynamic Binary Translators , 2016, ACM Trans. Archit. Code Optim..

[24]  Koen De Bosschere,et al.  Formal virtualization requirements for the ARM architecture , 2013, J. Syst. Archit..

[25]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX ATC, FREENIX Track.

[26]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[27]  Gary Brown,et al.  Denver: Nvidia's First 64-bit ARM Processor , 2015, IEEE Micro.

[28]  Cindy Zheng,et al.  PA-RISC to IA-64: Transparent Execution, No Recompilation , 2000, Computer.

[29]  E. Duesterwald,et al.  Software profiling for hot path prediction: less is more , 2000, SIGP.

[30]  Yeh-Ching Chung,et al.  ARMvisor : System Virtualization for ARM , 2012 .

[31]  John Yates,et al.  FX!32 a profile-directed binary translator , 1998, IEEE Micro.

[32]  Chi-Keung Luk,et al.  PinOS: a programmable framework for whole-system dynamic instrumentation , 2007, VEE '07.

[33]  Tadao Nakamura,et al.  On-the-fly detection of precise loop nests across procedures on a dynamic binary translation system , 2011, CF '11.

[34]  Wei-Chung Hsu,et al.  Efficient memory virtualization for Cross-ISA system mode emulation , 2014, VEE '14.

[35]  Gerald J. Popek,et al.  Formal requirements for virtualizable third generation architectures , 1974, SOSP '73.

[36]  Michael D. Smith,et al.  Persistent Code Caching: Exploiting Code Reuse Across Executions and Applications , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[37]  Jason Nieh,et al.  KVM/ARM: the design and implementation of the linux ARM hypervisor , 2014, ASPLOS.

[38]  Xin Tong,et al.  Optimizing Memory Translation Emulation in Full System Emulators , 2015, ACM Trans. Archit. Code Optim..

[39]  Weng-Fai Wong,et al.  Dynamic cache contention detection in multi-threaded applications , 2011, VEE '11.

[40]  Yun Wang,et al.  IA-32 execution layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium/spl reg/-based systems , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..