Effective exploitation of SIMD resources in cross-ISA virtualization

System virtualization is a fundamental technology that enables many important applications. However, existing virtualization techniques suffer from a critical limitation due to their limited exploitation of host SIMD hardware resources, especially when a guest application does not have inherently fine-grained data-level parallelism. To bridge this utilization gap and unleash the full potential of host SIMD resources, this paper proposes an effective and unconventional SIMD exploitation technique. The proposed exploitation takes advantage of ample host SIMD registers and powerful host SIMD instructions to generate more efficient host binary code for guest applications even without any fine-grained data-level parallelism. It also mitigates the shortage of general-purpose registers on the host platform, as well as improves the efficiency of accessing guest registers. We have implemented the exploitation in an extensively-used virtualization platform, QEMU. Experimental results on a comprehensive list of benchmarks from PARSEC, SPEC-CPU2017, and Google Octane JavaScript benchmark suite show that an average of 2.2X performance speedup can be achieved for AArch64 binaries on an x86-64 host machine. We believe the proposed technique will provide a new perspective for our community to rethink the exploitation of SIMD hardware resources.

[1]  Kenneth A. Ross,et al.  Rethinking SIMD Vectorization for In-Memory Databases , 2015, SIGMOD Conference.

[2]  Super-Node SLP: optimized vectorization for code sequences containing operators and their inverse elements , 2019, CGO 2019.

[3]  Mikel Luján,et al.  Low overhead dynamic binary translation on ARM , 2017, PLDI.

[4]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[5]  Zhang Jiang,et al.  DQEMU: A Scalable Emulator with Retargetable DBT on Distributed Platforms , 2020, ICPP.

[6]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX ATC, FREENIX Track.

[7]  Albert Cohen,et al.  Vapor SIMD: Auto-vectorize once, run everywhere , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[8]  Richard Johnson,et al.  The Transmeta Code Morphing#8482; Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, CGO.

[9]  Wu-chun Feng,et al.  ASPaS: A Framework for Automatic SIMDization of Parallel Sorting on x86-based Many-core Processors , 2015, ICS.

[10]  Michael D. Smith,et al.  Persistent Code Caching: Exploiting Code Reuse Across Executions and Applications , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[11]  Wei-Chung Hsu,et al.  Dynamic translation of structured Loads/Stores and register mapping for architectures with SIMD extensions , 2017, LCTES.

[12]  Vasileios Porpodas,et al.  SuperGraph-SLP Auto-Vectorization , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[13]  Stephen McCamant,et al.  Enhancing Cross-ISA DBT Through Automatically Learned Translation Rules , 2018, ASPLOS.

[14]  Wang Zhenjiang,et al.  A Pattern Translation Method for Flags in Binary Translation , 2014 .

[15]  Richard Johnson,et al.  The Transmeta Code Morphing/spl trade/ Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[16]  Stephen McCamant,et al.  Efficient and scalable cross-ISA virtualization of hardware transactional memory , 2020, CGO.

[17]  Yunhao Liu,et al.  Mobile Gaming on Personal Computers with Direct Android Emulation , 2019, MobiCom.

[18]  Bo Huang,et al.  Optimizing dynamic binary translation for SIMD instructions , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[19]  Alexander Heinecke,et al.  Anatomy of High-Performance Deep Learning Convolutions on SIMD Architectures , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Wenwen Wang,et al.  Improving Dynamically-Generated Code Performance on Dynamic Binary Translators , 2018, VEE.

[21]  Wenwen Wang,et al.  Unleashing the Power of Learning: An Enhanced Learning-Based Approach for Dynamic Binary Translation , 2019, USENIX Annual Technical Conference.

[22]  Binoy Ravindran,et al.  Cross-ISA execution of SIMD regions for improved performance , 2019, SYSTOR.

[23]  James Tuck,et al.  Improving the Effectiveness of Searching for Isomorphic Chains in Superword Level Parallelism , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[24]  Lei Zou,et al.  Speeding Up Set Intersections in Graph Algorithms using SIMD Instructions , 2018, SIGMOD Conference.

[25]  Alaa R. Alameldeen,et al.  ZCOMP: Reducing DNN Cross-Layer Memory Footprint Using Vector Extensions , 2019, MICRO.

[26]  Dean M. Tullsen,et al.  Execution migration in a heterogeneous-ISA chip multiprocessor , 2012, ASPLOS XVII.

[27]  Weihua Zhang,et al.  More with Less – Deriving More Translation Rules with Less Training Data for DBTs Using Parameterization , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[28]  Stephen McCamant,et al.  A General Persistent Code Caching Framework for Dynamic Binary Translation (DBT) , 2016, USENIX Annual Technical Conference.

[29]  Carol Eidt,et al.  SIMD support in .NET: abstract and concrete vector types and operations , 2020, CGO.

[30]  Decheng Zuo,et al.  PerfDBT: Efficient Performance Regression Testing of Dynamic Binary Translation , 2020, 2020 IEEE 38th International Conference on Computer Design (ICCD).

[31]  Xiaoli Gong,et al.  Enhancing Atomic Instruction Emulation for Cross-ISA Dynamic Binary Translation , 2021, 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[32]  Barton P. Miller,et al.  The Paradyn Parallel Performance Measurement Tool , 1995, Computer.

[33]  Tiark Rompf,et al.  SIMD intrinsics on managed language runtimes , 2018, CGO.

[34]  Nalini Vasudevan,et al.  FlexVec: auto-vectorization for irregular loops , 2016, PLDI.

[35]  Ajay Jain,et al.  Revec: program rejuvenation through revectorization , 2019, CC.

[36]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[37]  Viktor Leis,et al.  Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask , 2018, Proc. VLDB Endow..

[38]  Wei-Chung Hsu,et al.  Exploiting Asymmetric SIMD Register Configurations in ARM-to-x86 Dynamic Binary Translation , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[39]  Richard Veras,et al.  When polyhedral transformations meet SIMD code generation , 2013, PLDI.

[40]  Stephen McCamant,et al.  Enabling Cross-ISA Offloading for COTS Binaries , 2017, MobiSys.

[41]  Harry Wagstaff,et al.  A Retargetable System-level DBT Hypervisor , 2019, USENIX Annual Technical Conference.

[42]  Wenwen Wang,et al.  Helper function inlining in dynamic binary translation , 2021, CC.

[43]  Kenneth A. Ross,et al.  Implementing database operations using SIMD instructions , 2002, SIGMOD '02.

[44]  Derek Bruening,et al.  Efficient, transparent, and comprehensive runtime code manipulation , 2004 .