Quantitative characterization of the software layer of a HW/SW co-designed processor

HW/SW co-designed processors currently have a renewed interest due to their capability to boost performance without running into the power and complexity walls. By employing a software layer that performs dynamic binary translation and applies aggressive optimizations through exploiting the runtime application behavior, these hybrid architectures provide better performance/watt. However, a poorly designed software layer can result in significant translation/optimization overheads that may offset its benefits. This work presents a detailed characterization of the software layer of a HW/SW co-designed processor using a variety of benchmark suites. We observe that the performance of the software layer is very sensitive to the characteristics of the emulated application with a variance of more than 50%. We also show that the interaction between the software layer and the emulated application, while sharing the microarchitectural resources, can have 0-20% impact on performance. Finally, we identify some key elements which should be further investigated to reduce the observed variations in performance. The paper provides critical insights to improve the software layer design.

[1]  K. Ebcioglu,et al.  Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[2]  Erik R. Altman,et al.  Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[3]  Erik R. Altman,et al.  BOA: Targeting Multi-Gigahertz with Binary Translation , 1999 .

[4]  A. Klaiber The Technology Behind Crusoe TM Processors Low-power x 86-Compatible Processors Implemented with Code Morphing , 2000 .

[5]  Sanjay J. Patel,et al.  rePLay: A Hardware Framework for Dynamic Optimization , 2001, IEEE Trans. Computers.

[6]  David Crowe,et al.  Dynamic optimization of micro-operations , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[7]  Tevi Devor,et al.  IA-32 Execution Layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium®-based systems , 2003, MICRO.

[8]  Yun Wang,et al.  IA-32 execution layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium/spl reg/-based systems , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[9]  Alexander Klaiber,et al.  Automatic detection of logic bugs in hardware designs , 2003, Proceedings. 4th International Workshop on Microprocessor Test and Verification - Common Challenges and Solutions.

[10]  Richard Johnson,et al.  The Transmeta Code Morphing/spl trade/ Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[11]  Mary Lou Soffa,et al.  Retargetable and reconfigurable software dynamic translation , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[12]  James E. Smith,et al.  Hardware support for control transfers in code caches , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[13]  Derek Bruening,et al.  An infrastructure for adaptive dynamic optimization , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[14]  Avi Mendelson,et al.  Power awareness through selective dynamically optimized traces , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[15]  James E. Smith,et al.  Virtual machines - versatile platforms for systems and processes , 2005 .

[16]  Michael D. Smith,et al.  Managing bounded code caches in dynamic binary optimization systems , 2006, TACO.

[17]  Glenn Reinman,et al.  ParallAX: an architecture for real-time physics , 2007, ISCA '07.

[18]  Onur Mutlu,et al.  VPC prediction: reducing the cost of indirect branches via hardware-based dynamic devirtualization , 2007, ISCA '07.

[19]  Wei Hu,et al.  Evaluating Indirect Branch Handling Mechanisms in Software Dynamic Translation Systems , 2007, CGO.

[20]  Edson Borin,et al.  Characterization of DBT overhead , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[21]  Craig B. Zilles,et al.  A real system evaluation of hardware atomicity for software speculation , 2010, ASPLOS XV.

[22]  Vasanth Bala,et al.  Dynamo: a transparent dynamic optimization system , 2000, SIGP.

[23]  Balaji Dhanasekaran,et al.  Improving Indirect Branch Translation in Dynamic Binary Translators , 2011 .

[24]  Antonio González,et al.  Speculative dynamic vectorization to assist static vectorization in a HW/SW co-designed environment , 2013, 20th Annual International Conference on High Performance Computing.

[25]  Cheng Wang,et al.  Acceldroid: Co-designed acceleration of Android bytecode , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[26]  Kyriakos Stavrou,et al.  Speculative hardware/software co-designed floating-point multiply-add fusion , 2014, ASPLOS.

[27]  Antonio González,et al.  Efficient Power Gating of SIMD Accelerators Through Dynamic Selective Devectorization in an HW/SW Codesigned Environment , 2014, ACM Trans. Archit. Code Optim..

[28]  Craig B. Zilles,et al.  Bungee jumps: Accelerating indirect branches through HW/SW co-design , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[29]  Gary Brown,et al.  Denver: Nvidia's First 64-bit ARM Processor , 2015, IEEE Micro.

[30]  Lingjia Tang,et al.  PowerChop: Identifying and Managing Non-critical Units in Hybrid Processor Architectures , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).