Efficient and retargetable SIMD translation in a dynamic binary translator

The single‐instruction multiple‐data (SIMD) computing capability of modern processors is continually improved to deliver ever better performance and power efficiency. For example, Intel has increased SIMD register lengths from 128 bits in streaming SIMD extension to 512 bits in AVX‐512. The ARM scalable vector extension supports SIMD register length up to 2048 bits and includes predicated instructions. However, SIMD instruction translation in dynamic binary translation has not received similar attention. For example, the widely used QEMU emulates guest SIMD instructions with a sequence of scalar instructions, even when the host machines have relevant SIMD instructions. This leaves significant potential for performance enhancement. We propose a newly designed SIMD translation framework for dynamic binary translation, which takes advantage of the host's SIMD capabilities. The proposed framework has been built in HQEMU, an enhanced QEMU with a separate thread for applying LLVM optimizations. The current prototype supports ARMv7, ARMv8, and IA32 guests on the X86‐64 AVX‐2 host. Compared with the scalar‐translation version HQEMU, our framework runs up to 1.84 times faster on Standard Performance Evaluation Corporation 2006 CFP benchmarks and up to 6.81 times faster on selected real applications.

[1]  Fred Chow Intermediate C. Representation. , 2013 .

[2]  Philippe Clauss,et al.  Runtime Vectorization Transformations of Binary Code , 2017, International Journal of Parallel Programming.

[3]  Cliff Click,et al.  A Simple Graph-Based Intermediate Representation , 1995, Intermediate Representations Workshop.

[4]  Hanspeter Mössenböck,et al.  An intermediate representation for speculative optimizations in a dynamic compiler , 2013, VMIL '13.

[5]  James E. Smith,et al.  Virtual machines - versatile platforms for systems and processes , 2005 .

[6]  Hanspeter Mössenböck,et al.  An experimental study of the influence of dynamic compiler optimizations on Scala performance , 2013, SCALA@ECOOP.

[7]  Bo Huang,et al.  Optimizing dynamic binary translation for SIMD instructions , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[8]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX ATC, FREENIX Track.

[9]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[10]  Peng Wu,et al.  Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[11]  J. E. Smith,et al.  FUTURE SUPERSCALAR PROCESSORS BASED ON INSTRUCTION COMPOUNDING , 2007 .

[12]  Yun Wang,et al.  IA-32 execution layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium/spl reg/-based systems , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[13]  Albert Cohen,et al.  Vapor SIMD: Auto-vectorize once, run everywhere , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[14]  Hao Zhou,et al.  Exploiting mixed SIMD parallelism by reducing data reorganization overhead , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[15]  D. Woolley The White Paper. , 1972, British medical journal.

[16]  Wuu Yang,et al.  Translating the ARM Neon and VFP instructions in a binary translator , 2016, Softw. Pract. Exp..

[17]  Richard Henderson,et al.  Multi-platform auto-vectorization , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[18]  Rajeev Barua,et al.  Automatic Parallelization in a Binary Rewriter , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[19]  Fred Chow Intermediate representation , 2013, CACM.

[20]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[21]  Peng Wu,et al.  Efficient SIMD code generation for runtime alignment and length conversion , 2005, International Symposium on Code Generation and Optimization.

[22]  Jason Merrill Generic and gimple: A new tree represen-tation for entire functions , 2003 .

[23]  Seonggun Kim,et al.  Efficient SIMD code generation for irregular kernels , 2012, PPoPP '12.

[24]  Gang Ren,et al.  Optimizing data permutations for SIMD devices , 2006, PLDI '06.

[25]  Lizy Kurian John,et al.  Exploiting SIMD parallelism in DSP and multimedia algorithms using the AltiVec technology , 1999, ICS '99.

[26]  Minwoo Jang,et al.  The performance analysis of ARM NEON technology for mobile platforms , 2011, RACS.

[27]  Chien-Min Wang,et al.  HQEMU: a multi-threaded and retargetable dynamic binary translator on multicores , 2012, CGO '12.

[28]  Hao Zhou,et al.  A Compiler Approach for Exploiting Partial SIMD Parallelism , 2016, ACM Trans. Archit. Code Optim..

[29]  Ayal Zaks,et al.  Auto-vectorization of interleaved data for SIMD , 2006, PLDI '06.

[30]  Michael Gschwind,et al.  Dynamic Binary Translation and Optimization , 2001, IEEE Trans. Computers.

[31]  Hao Zhou,et al.  Loop-oriented array- and field-sensitive pointer analysis for automatic SIMD vectorization , 2016, LCTES.

[32]  Ahmed Zekri,et al.  ENHANCING THE MATRIX TRANSPOSE OPERATION USING INTEL AVX INSTRUCTION SET EXTENSION , 2014 .

[33]  Nalini Vasudevan,et al.  FlexVec: auto-vectorization for irregular loops , 2016, PLDI.

[34]  David Seal,et al.  ARM Architecture Reference Manual , 2001 .

[35]  Cindy Zheng,et al.  PA-RISC to IA-64: Transparent Execution, No Recompilation , 2000, Computer.

[36]  Wei-Chung Hsu,et al.  Improving SIMD code generation in QEMU , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[37]  Yun Wang,et al.  IA-32 Execution Layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium-based systems , 2003, MICRO.

[38]  Sebastian Hack,et al.  Whole-function vectorization , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[39]  Albert Cohen,et al.  Polyhedral-Model Guided Loop-Nest Auto-Vectorization , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[40]  Timothy M. Jones,et al.  PSLP: Padded SLP automatic vectorization , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[41]  Junaid Shuja,et al.  SIMDOM: A framework for SIMD instruction translation and offloading in heterogeneous mobile architectures , 2018, Trans. Emerg. Telecommun. Technol..

[42]  Wei-Chung Hsu,et al.  SIMD Code Translation in an Enhanced HQEMU , 2015, 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS).

[43]  Ayal Zaks,et al.  Outer-loop vectorization - revisited for short SIMD architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[44]  Jack J. Dongarra,et al.  Vectorizing compilers: a test suite and results , 1988, Proceedings. SUPERCOMPUTING '88.

[45]  Wei-Chung Hsu,et al.  Exploiting Longer SIMD Lanes in Dynamic Binary Translation , 2016, 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS).

[46]  David Gregg,et al.  Automatic Vectorization of Interleaved Data Revisited , 2015, ACM Trans. Archit. Code Optim..