Dynamic re-vectorization of binary code

In many cases, applications are not optimized for the hardware on which they run. Several reasons contribute to this unsatisfying situation, including legacy code, commercial code distributed in binary form, or deployment on compute farms. In fact, backward compatibility of ISA guarantees only the functionality, not the best exploitation of the hardware. In this work, we focus on maximizing the CPU efficiency for the SIMD extensions and propose to convert automatically, and at runtime, loops vectorized for an older version of the SIMD extension to a newer one. We propose a lightweight mechanism, that does not include a vectorizer, but instead leverages what a static vectorizer previously did. We show that many loops compiled for x86 SSE can be dynamically converted to the more recent and more powerful AVX; as well as, how correctness is maintained with regards to challenges such as data dependences and reductions. We obtain speedups in line with those of a native compiler targeting AVX. The re-vectorizer is implemented inside a dynamic optimization platform; it is completely transparent to the user, does not require rewriting binaries, and operates during program execution.

[1]  John Yates,et al.  FX!32 a profile-directed binary translator , 1998, IEEE Micro.

[2]  Richard Johnson,et al.  The Transmeta Code Morphing#8482; Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, CGO.

[3]  Richard Johnson,et al.  The Transmeta Code Morphing/spl trade/ Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[4]  Derek Bruening,et al.  Efficient, transparent, and comprehensive runtime code manipulation , 2004 .

[5]  Wei-Chung Hsu,et al.  Continuous Adaptive Object-Code Re-optimization Framework , 2004, Asia-Pacific Computer Systems Architecture Conference.

[6]  Allen,et al.  Optimizing Compilers for Modern Architectures , 2004 .

[7]  D. Naishlos,et al.  Autovectorization in GCC , 2004 .

[8]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX ATC, FREENIX Track.

[9]  No License,et al.  Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[10]  Bo Huang,et al.  Optimizing dynamic binary translation for SIMD instructions , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[11]  Richard Henderson,et al.  Multi-platform auto-vectorization , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[12]  Scott A. Mahlke,et al.  Liquid SIMD: Abstracting SIMD Hardware using Lightweight Dynamic Mapping , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[13]  Philippe Clauss,et al.  Performance driven data cache prefetching in a dynamic software optimization system , 2007, ICS '07.

[14]  Ayal Zaks,et al.  Outer-loop vectorization - revisited for short SIMD architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[15]  Ahmed El-Mahdy,et al.  Automatic Vectorization Using Dynamic Compilation and Tree Pattern Matching Technique in Jikes RVM , 2009 .

[16]  Albert Cohen,et al.  Polyhedral-Model Guided Loop-Nest Auto-Vectorization , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[17]  Xiao-Feng Li,et al.  Vectorization for Java , 2010, NPC.

[18]  Albert Cohen,et al.  Speculatively vectorized bytecode , 2011, HiPEAC.

[19]  Albert Cohen,et al.  Vapor SIMD: Auto-vectorize once, run everywhere , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[20]  David A. Padua,et al.  An Evaluation of Vectorizing Compilers , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[21]  Cédric Valensi A generic approach to the definition of low-level components for multi-architecture binary analysis , 2014 .

[22]  A. Ketterlin,et al.  PADRONE: a Platform for Online Profiling, Analysis, and Optimization , 2014 .