Translating the ARM Neon and VFP instructions in a binary translator

Binary translation attempts to emulate one instruction set with another on the same or different platforms. The important technique is widely used in modern software. Vector and floating‐point instructions are widely used in many applications, including multimedia, graphics, and gaming. Although these instructions are usually simulated with software in a binary translator, it is important to support them such that the host single‐instruction, multiple‐data (SIMD) and floating‐point hardware are efficiently used during emulation. We report our design and implementation of the emulation of ARM Neon and vector floating point (VFP) instructions in the machine‐code‐to‐low‐level‐virtual‐machine (MC2LLVM) binary translator. The Neon and VFP instructions are first translated into carefully chosen sequences of LLVM intermediate representation (IR), and later, the IR sequences are optimized and translated into the host native binary by the existing LLVM backend. Because MC2LLVM makes use of the vector and floating‐point types in LLVM IR, the generated host native binary can take full advantage of the vector and floating‐point functional units, if present, of the host machine. To be fully compliant with Neon and VFP instruction sets, all the features are supported, including the flush‐to‐zero mode, default not a number mode, and floating‐point exceptions. The experimental results show that code generated by MC2LLVM with the Neon and VFP extensions achieves an average speedup of 1.174× in SPEC 2006 benchmark suites and exhibits a floating‐point throughput of 12.05× in LINPACK, compared with code generated by MC2LLVM without the Neon and VFP extensions. Furthermore, MC2LLVM is 3.36× faster than QEMU for processing Neon/VFP instructions. Copyright © 2016 John Wiley & Sons, Ltd.

[1]  Volker Lindenstruth,et al.  Vc: A C++ library for explicit vectorization , 2012, Softw. Pract. Exp..

[2]  Standard for Floating-Point Arithmetic , 2018 .

[3]  Gregory R. Ganger,et al.  Designing computer systems with MEMS-based storage , 2000, ASPLOS.

[4]  Wuu Yang,et al.  LLBT: an LLVM-based static binary translator , 2012, CASES '12.

[5]  Ravi Nair,et al.  System Virtual Machines , 2005 .

[6]  Kim M. Hazelwood,et al.  A dynamic binary instrumentation engine for the ARM architecture , 2006, CASES '06.

[7]  Vasanth Bala,et al.  Dynamo: a transparent dynamic optimization system , 2000, SIGP.

[8]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX ATC, FREENIX Track.

[9]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[10]  James E. Smith,et al.  Virtual machines - versatile platforms for systems and processes , 2005 .

[11]  Wei-Chung Hsu,et al.  Design and Implementation of a Lightweight Dynamic Optimization System , 2004, J. Instr. Level Parallelism.

[12]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[13]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[14]  David Keppel,et al.  Shade: a fast instruction-set simulator for execution profiling , 1994, SIGMETRICS.

[15]  Chris Lattner,et al.  LLVM: AN INFRASTRUCTURE FOR MULTI-STAGE OPTIMIZATION , 2000 .

[16]  Jack W. Davidson,et al.  Addressing the challenges of DBT for the ARM architecture , 2009, LCTES '09.

[17]  Cindy Zheng,et al.  PA-RISC to IA-64: Transparent Execution, No Recompilation , 2000, Computer.

[18]  Richard L. Sites,et al.  Binary translation , 1993, CACM.

[19]  Kim M. Hazelwood,et al.  Scalable support for multithreaded applications on dynamic binary instrumentation systems , 2009, ISMM '09.

[20]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[21]  Derek Bruening,et al.  An infrastructure for adaptive dynamic optimization , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[22]  Leonid Boytsov,et al.  Decoding billions of integers per second through vectorization , 2012, Softw. Pract. Exp..

[23]  John Yates,et al.  FX!32 a profile-directed binary translator , 1998, IEEE Micro.

[24]  Eric M. Schwarz,et al.  Hardware implementations of denormalized numbers , 2003, Proceedings 2003 16th IEEE Symposium on Computer Arithmetic.

[25]  Wuu Yang,et al.  An LLVM-based hybrid binary translation system , 2012, 7th IEEE International Symposium on Industrial Embedded Systems (SIES'12).

[26]  R. Nigel Horspool,et al.  Compiler optimizations for processors with SIMD instructions , 2007, Softw. Pract. Exp..

[27]  A. Kivity,et al.  kvm : the Linux Virtual Machine Monitor , 2007 .

[28]  Wuu Yang,et al.  A Static Binary Translator for Efficient Migration of ARM based Applications , 2008 .

[29]  Thomas R. Gross,et al.  Fine-grained user-space security through virtualization , 2011, VEE '11.

[30]  stallman-richard-m-cygnus-solutions Debugging with GDB: The GNU Source-Level Debugger for GDB , 2000 .

[31]  Prasad A. Kulkarni,et al.  Analyzing and addressing false interactions during compiler optimization phase ordering , 2014, Softw. Pract. Exp..

[32]  Toshio Nakatani,et al.  A high‐performance sorting algorithm for multicore single‐instruction multiple‐data processors , 2012, Softw. Pract. Exp..