论文信息 - Architectural Specialization for Inter-Iteration Loop Dependence Patterns

Architectural Specialization for Inter-Iteration Loop Dependence Patterns

Hardware specialization is an increasingly common technique to enable improved performance and energy efficiency in spite of the diminished benefits of technology scaling. This paper proposes a new approach called explicit loop specialization (XLOOPS) based on the idea of elegantly encoding inter-iteration loop dependence patterns in the instruction set. XLOOPS supports a variety of inter-iteration data-and control-dependence patterns for both single and nested loops. The XLOOPS hardware/software abstraction requires only lightweight changes to a general-purpose compiler to generate XLOOPS binaries and enables executing these binaries on: (1) traditional micro architectures with minimal performance impact, (2) specialized micro architectures to improve performance and/or energy efficiency, and (3) adaptive micro architectures that can seamlessly migrate loops between traditional and specialized execution to dynamically trade-off performance vs. Energy efficiency. We evaluate XLOOPS using a vertically integrated research methodology and show compelling performance and energy efficiency improvements compared to both simple and complex general-purpose processors.

[1] Ken Kennedy,et al. Practical dependence testing , 1991, PLDI '91.

[2] Multiscalar processors , 1995, ISCA 1995.

[3] J. Wawrzynek. Spert-II : A Vector Micro Processore System, Special Issue of Neural Computing in , 1996 .

[4] Brian Kingsbury,et al. Spert-II: A Vector Microprocessor System , 1996, Computer.

[5] Mateo Valero,et al. Vector architectures: past, present and future , 1998, ICS '98.

[6] Josep Torrellas,et al. A Chip-Multiprocessor Architecture with Speculative Multithreading , 1999, IEEE Trans. Computers.

[7] Antonia Zhai,et al. A scalable approach to thread-level speculation , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[8] C. Jesshope. Implementing an efficient vector instruction set in a chip multi-processor using micro-threaded pipelines , 2001, Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001.

[9] Trevor Mudge,et al. MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[10] Chris R. Jesshope. Implementing an efficient vector instruction set in a chip multi-processor using micro-threaded pipelines , 2001 .

[11] Christoforos E. Kozyrakis,et al. Scalable Vector Processors for Embedded Systems , 2003, IEEE Micro.

[12] Automatic Application-Specific Instruction-Set Extensions Under Microarchitectural Constraints , 2003, International Journal of Parallel Programming.

[13] Jason Cong,et al. Application-specific instruction generation for configurable processor architectures , 2004, FPGA '04.

[14] Scott A. Mahlke,et al. Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[15] Christopher Batten,et al. The vector-thread architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[16] Christopher J. Hughes,et al. Carbon: architectural support for fine-grained parallelism on chip multiprocessors , 2007, ISCA '07.

[17] Scott A. Mahlke,et al. Uncovering hidden loop level parallelism in sequential applications , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[18] David Black-Schaffer,et al. Efficient Embedded Computing , 2008, Computer.

[19] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[20] Norman P. Jouppi,et al. CACTI 6.0: A Tool to Model Large Caches , 2009 .

[21] Jung Ho Ahn,et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[22] James R. Larus,et al. Transactional Memory, 2nd edition , 2010, Transactional Memory.

[23] Christoforos E. Kozyrakis,et al. Flexible architectural support for fine-grain scheduling , 2010, ASPLOS XV.

[24] Bradford M. Beckmann,et al. The gem5 simulator , 2011, CARN.

[25] Steven Swanson,et al. QSCORES: Trading dark silicon for scalable energy efficiency with quasi-specific cores , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[26] Karthikeyan Sankaralingam,et al. Dynamically Specialized Datapaths for energy efficient computing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[27] Amin Ansari,et al. Bundled execution of recurring traces for energy-efficient general purpose processing , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[28] AsanovićKrste,et al. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators , 2011 .

[29] Emmett Kilgariff,et al. Fermi GF100 GPU Architecture , 2011, IEEE Micro.

[30] Karthikeyan Sankaralingam,et al. DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing , 2012, IEEE Micro.

[31] Guy E. Blelloch,et al. Brief announcement: the problem based benchmark suite , 2012, SPAA '12.

[32] Karthikeyan Sankaralingam,et al. Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[33] Gu-Yeon Wei,et al. HELIX-RC: An architecture-compiler co-design for automatic parallelization of irregular programs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[34] Willie Anderson,et al. Hexagon DSP: An Architecture Optimized for Mobile Multimedia and Communications , 2014, IEEE Micro.

[35] Christopher Batten,et al. PyMTL: A Unified Framework for Vertically Integrated Computer Architecture Research , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[36] IEEE Micro , 2022 .