HW/SW co-designed processors: Challenges, design choices and a simulation infrastructure for evaluation

Improving single thread performance is a key challenge in modern microprocessors especially because the traditional approach of increasing clock frequency and deep pipelining cannot be pushed further due to power constraints. Therefore, researchers have been looking at unconventional architectures to boost single thread performance without running into the power wall. HW/SW co-designed processors like Nvidia Denver, are emerging as a promising alternative. However, HW/SW co-designed processors need to address some key challenges such as startup delay, providing high performance with simple hardware, translation/optimization overhead, etc. before they can become mainstream. A fundamental requirement for evaluating different design choices and trade-offs to meet these challenges is to have a simulation infrastructure. Unfortunately, there is no such infrastructure available today. Building the aforementioned infrastructure itself poses significant challenges as it encompasses the complexities of not only an architectural framework but also of a compilation one. This paper identifies the key challenges that HW/SW codesigned processors face and the basic requirements for a simulation infrastructure targeting these architectures. Furthermore, the paper presents DARCO, a simulation infrastructure to enable research in this domain.

[1]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[2]  Roni Rosner,et al.  Specialized dynamic optimizations for high-performance energy-efficient microarchitecture , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[3]  K. Ebcioglu,et al.  Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[4]  Yun Wang,et al.  IA-32 execution layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium/spl reg/-based systems , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[5]  Kyriakos Stavrou,et al.  Accurate off-line phase classification for HW/SW co-designed processors , 2014, Conf. Computing Frontiers.

[6]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[7]  Anoop Gupta,et al.  Complete computer system simulation: the SimOS approach , 1995, IEEE Parallel Distributed Technol. Syst. Appl..

[8]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[9]  Paolo Faraboschi,et al.  COTSon: infrastructure for full system simulation , 2009, OPSR.

[10]  Antonio González,et al.  Speculative dynamic vectorization for HW/SW codesigned processors , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[11]  Mary Lou Soffa,et al.  Overhead reduction techniques for software dynamic translation , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[12]  Antonio González,et al.  Speculative dynamic vectorization to assist static vectorization in a HW/SW co-designed environment , 2013, 20th Annual International Conference on High Performance Computing.

[13]  Jose Renau,et al.  ESESC: A fast multicore simulator using Time-Based Sampling , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[14]  Yun Wang,et al.  IA-32 Execution Layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium-based systems , 2003, MICRO.

[15]  Christopher J. Hughes,et al.  RSIM: Simulating Shared-Memory Multiprocessors with ILP Processors , 2002, Computer.

[16]  Erik R. Altman,et al.  Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[17]  Sanjay J. Patel,et al.  rePLay: A Hardware Framework for Dynamic Optimization , 2001, IEEE Trans. Computers.

[18]  Gary Brown,et al.  Denver: Nvidia's First 64-bit ARM Processor , 2015, IEEE Micro.

[19]  Vasanth Bala,et al.  Dynamo: a transparent dynamic optimization system , 2000, SIGP.

[20]  Derek Bruening,et al.  An infrastructure for adaptive dynamic optimization , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[21]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[22]  Richard Johnson,et al.  The Transmeta Code Morphing/spl trade/ Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[23]  Lieven Eeckhout,et al.  Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[24]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[25]  Glenn Reinman,et al.  ParallAX: an architecture for real-time physics , 2007, ISCA '07.

[26]  Mary Lou Soffa,et al.  Retargetable and reconfigurable software dynamic translation , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[27]  Kyriakos Stavrou,et al.  Warm-Up Simulation Methodology for HW/SW Co-Designed Processors , 2014, CGO '14.

[28]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX ATC, FREENIX Track.

[29]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[30]  Brian Walters,et al.  VMware Virtual Platform , 1999 .

[31]  Kyriakos Stavrou,et al.  Quantitative characterization of the software layer of a HW/SW co-designed processor , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[32]  Antonio González,et al.  Dynamic Selective Devectorization for Efficient Power Gating of SIMD Units in a HW/SW Co-Designed Environment , 2013, 2013 25th International Symposium on Computer Architecture and High Performance Computing.

[33]  George Kurian,et al.  Graphite: A distributed parallel simulator for multicores , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[34]  Jon Watson,et al.  VirtualBox: bits and bytes masquerading as machines , 2008 .

[35]  Thomas F. Wenisch,et al.  SimFlex: Statistical Sampling of Computer System Simulation , 2006, IEEE Micro.

[36]  Erik R. Altman,et al.  BOA: Targeting Multi-Gigahertz with Binary Translation , 1999 .

[37]  Craig B. Zilles,et al.  Discerning the dominant out-of-order performance advantage: is it speculation or dynamism? , 2013, ASPLOS '13.

[38]  Christoforos E. Kozyrakis,et al.  ZSim: fast and accurate microarchitectural simulation of thousand-core systems , 2013, ISCA.

[39]  Antonio González,et al.  Efficient Power Gating of SIMD Accelerators Through Dynamic Selective Devectorization in an HW/SW Codesigned Environment , 2014, ACM Trans. Archit. Code Optim..

[40]  Craig B. Zilles,et al.  A real system evaluation of hardware atomicity for software speculation , 2010, ASPLOS XV.

[41]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[42]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[43]  Cheng Wang,et al.  Acceldroid: Co-designed acceleration of Android bytecode , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[44]  Kyriakos Stavrou,et al.  Speculative hardware/software co-designed floating-point multiply-add fusion , 2014, ASPLOS.

[45]  Antonio González,et al.  Assisting Static Compiler Vectorization with a Speculative Dynamic Vectorizer in an HW/SW Codesigned Environment , 2016, ACM Trans. Comput. Syst..

[46]  James E. Smith,et al.  Virtual machines - versatile platforms for systems and processes , 2005 .

[47]  Antonio González,et al.  Vectorizing for Wider Vector Units in a HW/SW Co-designed Environment , 2013, 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing.