Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary Translation

Single instruction multiple data (SIMD) has been adopted for decades because of its superior performance and power efficiency. The SIMD capability (i.e., width, number of registers, and advanced instructions) has diverged rapidly on different SIMD instruction-set architectures (ISAs). Therefore, migrating existing applications to another host ISA that has fewer but longer SIMD registers and more advanced instructions raises the issues of asymmetric SIMD capability. To date, this issue has been overlooked and the host SIMD capability is underutilized, resulting in suboptimal performance. In this article, we present a novel binary translation technique called spill-aware superword level parallelism (saSLP), which combines short ARMv8 instructions and registers in the guest binaries to exploit the x86 AVX2 host’s parallelism, register capacity, and gather instructions. Our experiment results show that saSLP improves the performance by 1.6× (2.3×) across a number of benchmarks and reduces spilling by 97% (99%) for ARMv8 to x86 AVX2 (AVX-512) translation. Furthermore, with AVX2 (AVX-512) gather instructions, saSLP speeds up several data-irregular applications that cannot be vectorized on ARMv8 NEON by up to 3.9× (4.2×).

[1]  Brendan Dolan-Gavitt,et al.  Repeatable Reverse Engineering with PANDA , 2015, PPREW@ACSAC.

[2]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[3]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[4]  No License,et al.  Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[5]  Vasanth Bala,et al.  Dynamo: a transparent dynamic optimization system , 2000, SIGP.

[6]  Avinash Sodani,et al.  Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition 2nd Edition , 2016 .

[7]  J. Dongarra,et al.  HPCG Benchmark: a New Metric for Ranking High Performance Computing Systems∗ , 2015 .

[8]  Wei-Chung Hsu,et al.  Exploiting Longer SIMD Lanes in Dynamic Binary Translation , 2016, 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS).

[9]  Philippe Clauss,et al.  Runtime Vectorization Transformations of Binary Code , 2017, International Journal of Parallel Programming.

[10]  Cheng Wang,et al.  StarDBT: An Efficient Multi-platform Dynamic Binary Translation System , 2007, Asia-Pacific Computer Systems Architecture Conference.

[11]  Timothy M. Jones,et al.  PSLP: Padded SLP automatic vectorization , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[12]  Mateo Valero,et al.  Speculative dynamic vectorization , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[13]  Gagan Agrawal,et al.  An execution strategy and optimized runtime support for parallelizing irregular reductions on modern GPUs , 2011, ICS '11.

[14]  Kunle Olukotun,et al.  Efficient Parallel Graph Exploration on Multi-Core CPU and GPU , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[15]  Richard Johnson,et al.  The Transmeta Code Morphing#8482; Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, CGO.

[16]  Wei-Chung Hsu,et al.  SIMD Code Translation in an Enhanced HQEMU , 2015, 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS).

[17]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX ATC, FREENIX Track.

[18]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[19]  Yun Wang,et al.  IA-32 execution layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium/spl reg/-based systems , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[20]  Jim Jeffers,et al.  Knights Landing overview , 2016 .

[21]  Albert Cohen,et al.  Vapor SIMD: Auto-vectorize once, run everywhere , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[22]  Wei-Chung Hsu,et al.  Exploiting Asymmetric SIMD Register Configurations in ARM-to-x86 Dynamic Binary Translation , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[23]  Richard Johnson,et al.  The Transmeta Code Morphing/spl trade/ Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[24]  Mary Alexandra Agner Bristled wings could provide a propulsive punch for future micro air vehicles , 2018, Scilight.

[25]  Hao Zhou,et al.  Exploiting mixed SIMD parallelism by reducing data reorganization overhead , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[26]  Wei-Chung Hsu,et al.  Design and Implementation of a Lightweight Dynamic Optimization System , 2004, J. Instr. Level Parallelism.

[27]  Scott A. Mahlke,et al.  Liquid SIMD: Abstracting SIMD Hardware using Lightweight Dynamic Mapping , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[28]  Chien-Min Wang,et al.  HQEMU: a multi-threaded and retargetable dynamic binary translator on multicores , 2012, CGO '12.

[29]  Philippe Clauss,et al.  Dynamic re-vectorization of binary code , 2015, 2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS).

[30]  Frédéric Pétrot,et al.  Speeding-up SIMD instructions dynamic binary translation in embedded processor simulation , 2011, 2011 Design, Automation & Test in Europe.

[31]  Jaewook Shin,et al.  Superword-level parallelism in the presence of control flow , 2005, International Symposium on Code Generation and Optimization.

[32]  Yun Wang,et al.  IA-32 Execution Layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium-based systems , 2003, MICRO.

[33]  David A. Padua,et al.  An Evaluation of Vectorizing Compilers , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[34]  Bo Huang,et al.  Optimizing dynamic binary translation for SIMD instructions , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[35]  Seonggun Kim,et al.  Efficient SIMD code generation for irregular kernels , 2012, PPoPP '12.

[36]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[37]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[38]  John Yates,et al.  FX!32 a profile-directed binary translator , 1998, IEEE Micro.

[39]  Hao Zhou,et al.  A Compiler Approach for Exploiting Partial SIMD Parallelism , 2016, ACM Trans. Archit. Code Optim..