Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications

Chip multiprocessors with multiple simpler cores are gaining popularity because they have the potential to drive future performance gains without exacerbating the problems of power dissipation and complexity. Current chip multiprocessors increase throughput by utilizing multiple cores to perform computation in parallel. These designs provide real benefits for server-class applications that are explicitly multi-threaded. However, for desktop and other systems where single-thread applications dominate, multicore systems have yet to offer much benefit. Chip multiprocessors are most efficient at executing coarse-grain threads that have little communication. However, general-purpose applications do not provide many opportunities for identifying such threads, due to frequent use of pointers, recursive data structures, if-then-else branches, small function bodies, and loops with small trip counts. To attack this mismatch, this paper proposes a multicore architecture, referred to as Voltron that extends traditional multicore systems in two ways. First, it provides a dual-mode scalar operand network to enable efficient inter-core communication and lightweight synchronization. Second, Voltron can organize the cores for execution in either coupled or decoupled mode. In coupled mode, the cores execute multiple instruction streams in lock-step to collectively function as a wide-issue VLIW. In decoupled mode, the cores execute a set of fine-grain communicating threads extracted by the compiler. This paper describes the Voltron architecture and associated compiler support for orchestrating bi-modal execution

[1]  Scott A. Mahlke,et al.  A distributed control path architecture for VLIW processors , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[2]  Huiyang Zhou,et al.  Dual-core execution: building a highly scalable single-thread instruction window , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[3]  Microsystems Sun,et al.  Jini^ Architecture Specification Version 2.0 , 2003 .

[4]  Nathan L. Binkert,et al.  Network-Oriented Full-System Simulation using M5 , 2003 .

[5]  Henry Hoffmann,et al.  Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[6]  Ho-Seop Kim,et al.  An instruction set and microarchitecture for instruction level distributed processing , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[7]  Alexandre E. Eichenberger,et al.  Effective cluster assignment for modulo scheduling , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[8]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[9]  Antonio González,et al.  Modulo scheduling for a fully-distributed clustered VLIW architecture , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[10]  Robert P. Colwell,et al.  Architecture and implementation of a VLIW supercomputer , 1990, Proceedings SUPERCOMPUTING '90.

[11]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[12]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[13]  Hong-Seok Kim,et al.  Bottom-Up and Top-Down Context-Sensitive Summary-Based Pointer Analysis , 2004, SAS.

[14]  B. R. Rau,et al.  HPL-PD Architecture Specification:Version 1.1 , 2000 .

[15]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture , 2003, ISCA '03.

[16]  Andrew Wolfe,et al.  A variable instruction stream extension to the VLIW architecture , 1991, ASPLOS IV.

[17]  Anant Agarwal,et al.  Scalar operand networks: on-chip interconnect for ILP in partitioned architectures , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[18]  John R. Ellis,et al.  Bulldog: A Compiler for VLIW Architectures , 1986 .

[19]  Norman P. Jouppi,et al.  Processor Power Reduction Via Single-ISA , 2003 .

[20]  Lori Pollock,et al.  An experimental study of several cooperative register allocation and instruction scheduling strategies , 1995, MICRO 1995.

[21]  David L. Kuck,et al.  The Structure of Computers and Computations , 1978 .

[22]  William J. Dally,et al.  The J-machine Multicomputer: An Architectural Evaluation , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[23]  Joseph A. Fisher,et al.  Very long instruction work architectures and the ELI-512 , 1983, ISCA '98.

[24]  Scott A. Mahlke,et al.  Region-based hierarchical operation partitioning for multicluster processors , 2003, PLDI '03.

[25]  Antonio González,et al.  Modulo scheduling for a fully-distributed clustered VLIW architecture , 2000, MICRO 33.

[26]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[27]  Guilherme Ottoni,et al.  Automatic thread extraction with decoupled software pipelining , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[28]  Rainer Leupers,et al.  Code optimization techniques for embedded processors - methods, algorithms, and tools , 2000 .

[29]  Dr. Rainer Leupers Code Optimization Techniques for Embedded Processors , 2000, Springer US.

[30]  Thomas M. Conte,et al.  Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[31]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[32]  Scott A. Mahlke,et al.  Extracting Statistical Loop-Level Parallelism using Hardware-Assisted Recovery , 2007 .