Architectural Specialization for Inter-Iteration Loop Dependence Patterns

Hardware specialization is an increasingly common technique to enable improved performance and energy efficiency in spite of the diminished benefits of technology scaling. This paper proposes a new approach called explicit loop specialization (XLOOPS) based on the idea of elegantly encoding inter-iteration loop dependence patterns in the instruction set. XLOOPS supports a variety of inter-iteration data-and control-dependence patterns for both single and nested loops. The XLOOPS hardware/software abstraction requires only lightweight changes to a general-purpose compiler to generate XLOOPS binaries and enables executing these binaries on: (1) traditional micro architectures with minimal performance impact, (2) specialized micro architectures to improve performance and/or energy efficiency, and (3) adaptive micro architectures that can seamlessly migrate loops between traditional and specialized execution to dynamically trade-off performance vs. Energy efficiency. We evaluate XLOOPS using a vertically integrated research methodology and show compelling performance and energy efficiency improvements compared to both simple and complex general-purpose processors.

[1]  Ken Kennedy,et al.  Practical dependence testing , 1991, PLDI '91.

[2]  Multiscalar processors , 1995, ISCA 1995.

[3]  J. Wawrzynek Spert-II : A Vector Micro Processore System, Special Issue of Neural Computing in , 1996 .

[4]  Brian Kingsbury,et al.  Spert-II: A Vector Microprocessor System , 1996, Computer.

[5]  Mateo Valero,et al.  Vector architectures: past, present and future , 1998, ICS '98.

[6]  Josep Torrellas,et al.  A Chip-Multiprocessor Architecture with Speculative Multithreading , 1999, IEEE Trans. Computers.

[7]  Antonia Zhai,et al.  A scalable approach to thread-level speculation , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[8]  C. Jesshope Implementing an efficient vector instruction set in a chip multi-processor using micro-threaded pipelines , 2001, Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001.

[9]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[10]  Chris R. Jesshope Implementing an efficient vector instruction set in a chip multi-processor using micro-threaded pipelines , 2001 .

[11]  Christoforos E. Kozyrakis,et al.  Scalable Vector Processors for Embedded Systems , 2003, IEEE Micro.

[12]  Automatic Application-Specific Instruction-Set Extensions Under Microarchitectural Constraints , 2003, International Journal of Parallel Programming.

[13]  Jason Cong,et al.  Application-specific instruction generation for configurable processor architectures , 2004, FPGA '04.

[14]  Scott A. Mahlke,et al.  Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[15]  Christopher Batten,et al.  The vector-thread architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[16]  Christopher J. Hughes,et al.  Carbon: architectural support for fine-grained parallelism on chip multiprocessors , 2007, ISCA '07.

[17]  Scott A. Mahlke,et al.  Uncovering hidden loop level parallelism in sequential applications , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[18]  David Black-Schaffer,et al.  Efficient Embedded Computing , 2008, Computer.

[19]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[20]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[21]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[22]  James R. Larus,et al.  Transactional Memory, 2nd edition , 2010, Transactional Memory.

[23]  Christoforos E. Kozyrakis,et al.  Flexible architectural support for fine-grain scheduling , 2010, ASPLOS XV.

[24]  Bradford M. Beckmann,et al.  The gem5 simulator , 2011, CARN.

[25]  Steven Swanson,et al.  QSCORES: Trading dark silicon for scalable energy efficiency with quasi-specific cores , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[26]  Karthikeyan Sankaralingam,et al.  Dynamically Specialized Datapaths for energy efficient computing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[27]  Amin Ansari,et al.  Bundled execution of recurring traces for energy-efficient general purpose processing , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[28]  AsanovićKrste,et al.  Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators , 2011 .

[29]  Emmett Kilgariff,et al.  Fermi GF100 GPU Architecture , 2011, IEEE Micro.

[30]  Karthikeyan Sankaralingam,et al.  DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing , 2012, IEEE Micro.

[31]  Guy E. Blelloch,et al.  Brief announcement: the problem based benchmark suite , 2012, SPAA '12.

[32]  Karthikeyan Sankaralingam,et al.  Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[33]  Gu-Yeon Wei,et al.  HELIX-RC: An architecture-compiler co-design for automatic parallelization of irregular programs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[34]  Willie Anderson,et al.  Hexagon DSP: An Architecture Optimized for Mobile Multimedia and Communications , 2014, IEEE Micro.

[35]  Christopher Batten,et al.  PyMTL: A Unified Framework for Vertically Integrated Computer Architecture Research , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[36]  IEEE Micro , 2022 .