Holistic design for multi-core architectures

Increasing design complexity and diminishing marginal utility of monolithic processor designs has resulted in integration of multiple loosely-coupled processing cores on the same die. However, fundamental questions remain about the right form, implementation, and methodology for multi-core designs. This thesis addresses these questions. A popular methodology for designing a multi-core architecture is to replicate an off-the-shelf core design multiple times, and then connect the cores together using an interconnect mechanism. However, this methodology is "multi-core oblivious" as subsystems are designed/optimized unaware of the overall chip-multiprocessing system they would become parts of. This thesis demonstrates that this methodology is very inefficient in terms of area/power, and recommends a holistic approach where the subsystems are designed from the ground up as different components of a full system. Inefficiency in "multi-core oblivious" multi-core designs comes at different levels. Having multiple replicated cores results in an inability to adapt to the demands of execution workloads, and results in either underutilization or overutilization of processor resources. This thesis proposes single-ISA (instruction-set architecture) heterogeneous multi-core architectures where the die hosts cores of varying power/performance characteristics, but all capable of running the same ISA. Such a processor can result in significant power savings and performance improvements if the applications are mapped to cores judiciously. The thesis also presents holistic design methodologies for such architectures. Another source of inefficiency is blind replication of over-provisioned hardware structures. To that effect, the thesis proposes conjoined-core chip multiprocessing where the adjacent cores of a multi-core architecture share some resources. The thesis shows that this can result in significant area savings without much performance degradation. The thesis also proposes novel optimizations for minimizing the already small degradation. Yet another source of inefficiency is the interconnection. This thesis shows that the interconnection overheads can be very significant for a "multi-core oblivious" multi-core design---especially as the number of cores increases and the pipelines get deeper. The thesis demonstrates the need to co-design the cores, the memory and the interconnection to obviate the inefficiency problem, and also makes several suggestions regarding co-design.

[1]  André Seznec,et al.  CASH: Revisiting Hardware Sharing in Single-Chip Parallel Processors , 2004, J. Instr. Level Parallelism.

[2]  Norman P. Jouppi,et al.  A Multi-Core Approach to Addressing the Energy-Complexity Problem in Microprocessors , 2003 .

[3]  Mikko H. Lipasti,et al.  A performance methodology for commercial servers , 2000, IBM J. Res. Dev..

[4]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[5]  Norman P. Jouppi,et al.  Processor Power Reduction Via Single-ISA Heterogeneous Multi-Core Architectures , 2003, IEEE Computer Architecture Letters.

[6]  Charles L. Seitz,et al.  The cosmic cube , 1985, CACM.

[7]  Margaret Martonosi,et al.  Run-time power estimation in high performance microprocessors , 2001, ISLPED '01.

[8]  Yu Bai,et al.  Dynamically Reconfiguring Processor Resources to Reduce Power Consumption in High-Performance Processors , 2000, PACS.

[9]  Daniele Folegnani,et al.  Reducing Power Consumption of the Issue Logic , 2000 .

[10]  Mark Horowitz,et al.  Energy dissipation in general purpose microprocessors , 1996, IEEE J. Solid State Circuits.

[11]  Dirk Grunwald,et al.  Using IPC Variation in Workloads with Externally Specified R ates to Reduce Power Consumption , 2000 .

[12]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[13]  Norman P. Jouppi,et al.  Conjoined-Core Chip Multiprocessing , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[14]  R.H. Dennard,et al.  Design Of Ion-implanted MOSFET's with Very Small Physical Dimensions , 1974, Proceedings of the IEEE.

[15]  Diana Marculescu,et al.  Power aware microarchitecture resource scaling , 2001, Proceedings Design, Automation and Test in Europe. Conference and Exhibition 2001.

[16]  Axel Jantsch,et al.  Network on Chip : An architecture for billion transistor era , 2000 .

[17]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture , 2003, IEEE Micro.

[18]  John Edward Cronin,et al.  Submicron wiring technology with tungsten and planarization , 1987 .

[19]  Uri C. Weiser,et al.  Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors , 2006, IEEE Computer Architecture Letters.

[20]  Burton M. Leary,et al.  A 200 MHz 64 b dual-issue CMOS microprocessor , 1992, 1992 IEEE International Solid-State Circuits Conference Digest of Technical Papers.

[21]  Brad Calder,et al.  Discovering and Exploiting Program Phases , 2003, IEEE Micro.

[22]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[23]  W. Dally,et al.  Route packets, not wires: on-chip interconnection networks , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[24]  Kunle Olukotun,et al.  Maximizing CMP throughput with mediocre cores , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[25]  Dirk Grunwald,et al.  Confidence estimation for speculation control , 1998, ISCA.

[26]  Kunle Olukotun,et al.  Data speculation support for a chip multiprocessor , 1998, ASPLOS VIII.

[27]  John Paul Shen,et al.  Mitigating Amdahl's law through EPI throttling , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[28]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[29]  Norman P. Jouppi,et al.  Computer technology and architecture: an evolving interaction , 1991, Computer.

[30]  Josep Torrellas,et al.  A clustered approach to multithreaded processors , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[31]  Matthew Mattina,et al.  Tarantula: a vector extension to the alpha architecture , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[32]  Uri C. Weiser,et al.  ACCMP-assymetric cluster chip-multiprocessing , 2004 .

[33]  James K. Archibald,et al.  Cache coherence protocols: evaluation using a multiprocessor simulation model , 1986, TOCS.

[34]  Thomas N. Theis,et al.  The future of interconnection technology , 2000, IBM J. Res. Dev..

[35]  Stephen H. Gunther,et al.  Managing the Impact of Increasing Microprocessor Power Consumption , 2001 .

[36]  Ashok Kumar,et al.  The HP PA-8000 RISC CPU , 1997, IEEE Micro.

[37]  S. J. Frank,et al.  Tightly coupled multiprocessor system speeds memory-access times , 1984 .

[38]  J. Petrovick,et al.  The circuit and physical design of the POWER4 microprocessor , 2002, IBM J. Res. Dev..

[39]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[40]  Jaehyuk Huh,et al.  Exploring the design space of future CMPs , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[41]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[42]  Mark S. Squillante,et al.  Evaluation of Multithreaded Uniprocessors for Commercial Application Environments , 1996, ISCA.

[43]  Diana Marculescu,et al.  Power and performance evaluation of globally asynchronous locally synchronous processors , 2002, ISCA.

[44]  John Paul Shen,et al.  Best of both latency and throughput , 2004, IEEE International Conference on Computer Design: VLSI in Computers and Processors, 2004. ICCD 2004. Proceedings..

[45]  Dean M. Tullsen,et al.  Clustered multithreaded architectures - pursuing both IPC and cycle time , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[46]  Dean M. Tullsen,et al.  Handling long-latency loads in a simultaneous multithreading processor , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[47]  Dean M. Tullsen,et al.  Fellowship - Simulation And Modeling Of A Simultaneous Multithreading Processor , 1996, Int. CMG Conference.

[48]  Balaram Sinharoy,et al.  Design and implementation of the POWER5 microprocessor , 2004, Proceedings. 41st Design Automation Conference, 2004..

[49]  Dirk Grunwald,et al.  Aide de Camp: Asymmetric Dual Core Design for Power and Energy Reduction ; CU-CS-964-03 , 2003 .

[50]  Michael J. Flynn,et al.  An area model for on-chip memories and its application , 1991 .

[51]  Ravi Rajwar,et al.  The impact of performance asymmetry in emerging multicore architectures , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[52]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[53]  Soonhoi Ha,et al.  A Static Scheduling Heuristic for Heterogeneous Processors , 1996, Euro-Par, Vol. II.

[54]  Li-Shiuan Peh,et al.  Flow control and micro-architectural mechanisms for extending the performance of interconnection networks , 2001 .

[55]  Antonio González,et al.  Clustered speculative multithreaded processors , 1999, ICS '99.

[56]  Janak H. Patel,et al.  A low-overhead coherence solution for multiprocessors with private cache memories , 1984, ISCA '84.

[57]  Kevin P. McAuliffe,et al.  The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture , 1985, ICPP.

[58]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[59]  Donald Yeung,et al.  Sparcle: an evolutionary processor design for large-scale multiprocessors , 1993, IEEE Micro.

[60]  Kevin Knight,et al.  Artificial intelligence (2. ed.) , 1991 .

[61]  James Laudon,et al.  Performance/Watt: the new server focus , 2005, CARN.

[62]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[63]  Soraya Ghiasi,et al.  Scheduling for heterogeneous processors in server systems , 2005, CF '05.

[64]  Dean M. Tullsen,et al.  Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[65]  Hal Wasserman,et al.  Comparing algorithm for dynamic speed-setting of a low-power CPU , 1995, MobiCom '95.

[66]  David H. Albonesi,et al.  Selective cache ways: on-demand cache resource allocation , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[67]  R. Kotla,et al.  Characterizing the impact of different memory-intensity levels , 2004, IEEE International Workshop on Workload Characterization, 2004. WWC-7. 2004.

[68]  Kunle Olukotun,et al.  A Single-Chip Multiprocessor , 1997, Computer.

[69]  Michael L. Scott,et al.  Energy-efficient processor design using multiple clock domains with dynamic voltage and frequency scaling , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[70]  Norman P. Jouppi,et al.  Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .

[71]  Dirk Grunwald,et al.  Pipeline gating: speculation control for energy reduction , 1998, ISCA.

[72]  K. Steinhubl Design of Ion-Implanted MOSFET'S with Very Small Physical Dimensions , 1974 .

[73]  Jian Li,et al.  Power-Performance Implications of Thread-level Parallelism on Chip Multiprocessors , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[74]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[75]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[76]  R. Hokinson,et al.  Implementation of an Alpha microprocessor in SOI , 2003, 2003 IEEE International Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC..

[77]  John Paul Shen,et al.  Speculative precomputation: long-range prefetching of delinquent loads , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[78]  Sujit Dey,et al.  On-chip communication architecture for OC-768 network processors , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[79]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[80]  Mark Horowitz,et al.  Scaling, Power and the Future of CMOS , 2007, 20th International Conference on VLSI Design held jointly with 6th International Conference on Embedded Systems (VLSID'07).

[81]  Artur W. Klauser,et al.  Trends in high-performance microprocessor design , 2001 .

[82]  Shashank Gupta,et al.  Technology Independent Area and Delay Estimations for MicroprocessorBuilding Blocks , 2001 .

[83]  Shreekant S. Thakkar,et al.  The Symmetry Multiprocessor System , 1988, ICPP.

[84]  Yves Robert,et al.  The Master-Slave Paradigm with Heterogeneous Processors , 2001, CLUSTER.

[85]  Norman P. Jouppi,et al.  Core architecture optimization for heterogeneous chip multiprocessors , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[86]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures for multithreaded workload performance , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[87]  Jean-Luc Gaudiot,et al.  Area and system clock effects on SMT/CMP processors , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[88]  Thomas D. Burd,et al.  The simulation and evaluation of dynamic voltage scaling algorithms , 1998, Proceedings. 1998 International Symposium on Low Power Electronics and Design (IEEE Cat. No.98TH8379).

[89]  Jean-Luc Gaudiot,et al.  SMT Layout Overhead and Scalability , 2002, IEEE Trans. Parallel Distributed Syst..

[90]  Daniel Gajski,et al.  CEDAR: a large scale multiprocessor , 1983, CARN.

[91]  T.H. Lee,et al.  A 600 MHz superscalar RISC microprocessor with out-of-order execution , 1997, 1997 IEEE International Solids-State Circuits Conference. Digest of Technical Papers.

[92]  Andrew W. Wilson,et al.  Hierarchical cache/bus architecture for shared memory multiprocessors , 1987, ISCA '87.

[93]  Michel Dubois,et al.  Synchronization, coherence, and event ordering in multiprocessors , 1988, Computer.

[94]  Keith Diefendorff Compaq chooses smt for alpha: simultaneous multithreading exploits instruction- and thread-level par , 1999 .

[95]  Brad Calder,et al.  Time Varying Behavior of Programs , 1999 .

[96]  Craig Zilles,et al.  Execution-based prediction using speculative slices , 2001, ISCA 2001.

[97]  Ken Mai,et al.  The future of wires , 2001, Proc. IEEE.

[98]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor architecture , 1998, Proceedings International Conference on Computer Design. VLSI in Computers and Processors (Cat. No.98CB36273).