Factored multi-core architectures

Technology scaling trends have forced designers to consider alternatives to deeply pipelining aggressive cores with large amounts of performance accelerating hardware. One alternative is to factor out or decouple large structures from critical pipeline loops. In this work, we combine prior techniques in factoring into a cohesive framework and extend this paradigm to more of the processor core. We propose an architecture where the large structures and latency tolerant performance accelerators are factored out of the processor core into helpers and the small and fast μ-core can be augmented with these latency tolerant helpers. This design will reduce the number of accesses to the large, power hungry structures and hence provide power savings with minimal impact on performance. Also this architecture allows the use of slower, more power efficient circuit designs for helpers. As the demands placed on the processor core varies between applications, and even between phases of an application, the benefit seen from any set of helpers will vary tremendously. If there is a single core, these auxiliary structures can be turned on and off dynamically to tune the energy/performance of the machine to the needs of the running application. This is achieved by taking advantage of the dynamic reconfigurability or polymorphism of helpers and allowing a core to adapt to changing applications, workloads, or phases. As more of the processor is broken down into helpers, and additional cores are added to a single chip, the μ-core design provides a unique opportunity of sharing helpers among the cores. This can be considered a middle point between two extremes of Simultaneous Multiprocessing (SMT) where most processor resources are shared among multiple threads running simultaneously and Chip Multiprocessing (CMP) where each thread is running on a separate core, possibly sharing only the second level on-chip cache. With the opportunity of sharing helpers among multiple cores, the decisions that are made about these structures become increasingly important. Which resources should be shared, how should they be allocated and how can we efficiently manage their power? To answer these questions at run-time, we need a set of shared helper management policies that can adaptively allocate resources in a way that takes into consideration the needs of each executing workload. In this work we describe the need for methods that effectively manage these helpers. Along with selectively enabling different helpers, our techniques determine whether to assign exclusive or shared ownership of helpers across multiple cores. By intelligently correlating run-time observations with past measured benefits, our management schemes enable dynamic selection of customized configurations which increase performance and reduce power consumption. Finally we evaluate the thermal efficiency of the μ-core architecture. We investigate activity migration (core swapping) as a means of controlling the thermal profile of the chip. Specifically, the μ-core architecture presents an ideal platform for core swapping thanks to helpers that maintain the state of each process in a shared fabric surrounding the cores. This results in significantly reduced migration overhead, enabling seamless swapping of cores. Furthermore, we evaluate alternative approaches to spending the area overhead of the additional microcore, including larger μ-cores, CMP cores, and SMT cores running two-threaded workloads. We evaluate our design compared to different thermal management techniques such as global clock gating and frequency scaling.

[1]  Glenn Reinman,et al.  Selective value prediction , 1999, ISCA.

[2]  Brad Calder,et al.  Basic block distribution analysis to find periodic behavior and simulation points in applications , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[3]  Kai Wang,et al.  Highly accurate data value prediction using hybrid predictors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[4]  David Blaauw,et al.  Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation , 2003, MICRO.

[5]  R. Viswanath Thermal Performance Challenges from Silicon to Systems , 2000 .

[6]  James E. Smith,et al.  Managing multi-configuration hardware via dynamic working set analysis , 2002, ISCA.

[7]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[8]  James E. Smith,et al.  The predictability of data values , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[9]  Paul T. Hulina,et al.  A decoupled access/execute architecture for efficient access of structured data , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[10]  Glenn Reinman,et al.  An Evaluation of Deeply Decoupled Cores , 2006, J. Instr. Level Parallelism.

[11]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[12]  Margaret Martonosi,et al.  Temperature-Aware Design Issues for SMT and CMP Architectures , 2004 .

[13]  Norman P. Jouppi,et al.  Conjoined-Core Chip Multiprocessing , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[14]  José González,et al.  Power-aware control speculation through selective throttling , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[15]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[16]  Kevin Skadron,et al.  Temperature-aware microarchitecture , 2003, ISCA '03.

[17]  John Paul Shen,et al.  Efficacy and performance impact of value prediction , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[18]  Mark Bohr Silicon trends and limits for advanced microprocessors , 1998, CACM.

[19]  R. D. Barnes,et al.  An Architectural Framework for Run-Time Optimization , 2001 .

[20]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures for multithreaded workload performance , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[22]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[23]  James E. Smith,et al.  Comparing Program Phase Detection Techniques , 2003, MICRO.

[24]  Dirk Grunwald,et al.  Thermal Management with Asymmetric Dual Core Designs , 2003 .

[25]  Seung-Moon Yoo,et al.  A framework for dynamic energy efficiency and temperature management , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[26]  James E. Smith,et al.  Instruction-Level Distributed Processing , 2001, Computer.

[27]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[28]  José González,et al.  The potential of data value speculation to boost ILP , 1998, ICS '98.

[29]  Chris Wilkerson,et al.  Locality vs. criticality , 2001, ISCA 2001.

[30]  William H. Mangione-Smith,et al.  The filter cache: an energy efficient memory structure , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[31]  James E. Smith,et al.  An instruction set and microarchitecture for instruction level distributed processing , 2002, ISCA.

[32]  Todd C. Mowry,et al.  Cooperative prefetching: compiler and hardware support for effective instruction prefetching in modern processors , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[33]  S. McFarling Combining Branch Predictors , 1993 .

[34]  Brad Calder,et al.  Phase tracking and prediction , 2003, ISCA '03.

[35]  Richard E. Kessler,et al.  Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[36]  Jean-Loup Baer,et al.  Reducing memory latency via non-blocking and prefetching caches , 1992, ASPLOS V.

[37]  John Paul Shen,et al.  Instruction path coprocessors , 2000, ISCA '00.

[38]  David H. Albonesi,et al.  Selective cache ways: on-demand cache resource allocation , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[39]  Kunle Olukotun,et al.  A Single-Chip Multiprocessor , 1997, Computer.

[40]  Rajeev Balasubramonian,et al.  Reducing the complexity of the register file in dynamic superscalar processors , 2001, MICRO.

[41]  Yale N. Patt,et al.  Reducing the performance impact of instruction cache misses by writing instructions into the reservation stations out-of-order , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[42]  Krste Asanovic,et al.  Reducing power density through activity migration , 2003, ISLPED '03.

[43]  T. N. Vijaykumar,et al.  Heat-and-run: leveraging SMT and CMP to manage power density through the operating system , 2004, ASPLOS XI.

[44]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[45]  Sang Jeong Lee,et al.  Decoupled value prediction on trace processors , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[46]  Gurindar S. Sohi,et al.  ARB: A Hardware Mechanism for Dynamic Reordering of Memory References , 1996, IEEE Trans. Computers.

[47]  Norman P. Jouppi,et al.  The multicluster architecture: reducing cycle time through partitioning , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[48]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[49]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[50]  Andrew R. Pleszkun,et al.  Implementation of precise interrupts in pipelined processors , 1985, ISCA '98.

[51]  Glenn Reinman,et al.  A scalable front-end architecture for fast instruction delivery , 1999, ISCA.

[52]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[53]  James E. Smith,et al.  Modeling program predictability , 1998, ISCA.

[54]  Norman P. Jouppi,et al.  Reconfigurable caches and their application to media processing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[55]  Trevor Mudge Power: A First Class Design Constraint for Future Architecture and Automation , 2000, HiPC.

[56]  Norman P. Jouppi,et al.  Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .

[57]  André Seznec,et al.  CASH: Revisiting Hardware Sharing in Single-Chip Parallel Processors , 2004, J. Instr. Level Parallelism.

[58]  Brad Calder,et al.  Time Varying Behavior of Programs , 1999 .

[59]  Kevin Skadron,et al.  HotLeakage: A Temperature-Aware Model of Subthreshold and Gate Leakage for Architects , 2003 .

[60]  Gurindar S. Sohi,et al.  A static power model for architects , 2000, MICRO 33.

[61]  Kaustav Banerjee,et al.  Analysis of non-uniform temperature-dependent interconnect performance in high performance ICs , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[62]  W. Robert Daasch,et al.  A thermal-aware superscalar microprocessor , 2002, Proceedings International Symposium on Quality Electronic Design.

[63]  Yale N. Patt,et al.  A comprehensive instruction fetch mechanism for a processor supporting speculative execution , 1992, MICRO 25.

[64]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[65]  Norman P. Jouppi,et al.  Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction , 2003, MICRO.

[66]  P. Chow,et al.  Memory-system Design Considerations For Dynamically-scheduled Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[67]  Diana Marculescu,et al.  Power aware microarchitecture resource scaling , 2001, Proceedings Design, Automation and Test in Europe. Conference and Exhibition 2001.

[68]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[69]  Glenn Reinman,et al.  Tornado warning: the perils of selective replay in multithreaded processors , 2005, ICS '05.

[70]  Maurice V. Wilkes,et al.  The memory wall and the CMOS end-point , 1995, CARN.

[71]  Stephen H. Gunther,et al.  Managing the Impact of Increasing Microprocessor Power Consumption , 2001 .

[72]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[73]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[74]  M TullsenDean,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000 .

[75]  Rajeev Balasubramonian,et al.  Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures , 2000, MICRO 33.

[76]  Francisco Tirado,et al.  A power perspective of value speculation for superscalar microprocessors , 2000, Proceedings 2000 International Conference on Computer Design.

[77]  James E. Smith,et al.  Concurrent garbage collection using hardware-assisted profiling , 2000, ISMM '00.

[78]  Martin Burtscher,et al.  Prediction Outcome History-Based Confidence Estimation for Load Value Prediction , 1999, J. Instr. Level Parallelism.

[79]  James E. Smith,et al.  Prefetching in supercomputer instruction caches , 1992, Proceedings Supercomputing '92.

[80]  Joel S. Emer,et al.  Loose loops sink chips , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[81]  John Paul Shen,et al.  Efficient and Accurate Value Prediction Using Dynamic Classification , 1998 .

[82]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[83]  Kevin Skadron,et al.  Performance, energy, and thermal considerations for SMT and CMP architectures , 2005, 11th International Symposium on High-Performance Computer Architecture.

[84]  Margaret Martonosi,et al.  Dynamic thermal management for high-performance microprocessors , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[85]  Michael C. Huang,et al.  Positional adaptation of processors: application to energy reduction , 2003, ISCA '03.

[86]  A.S. Dhodapkar,et al.  Dynamic microarchitecture adaptation via co-designed virtual machines , 2002, 2002 IEEE International Solid-State Circuits Conference. Digest of Technical Papers (Cat. No.02CH37315).

[87]  Manish Gupta,et al.  Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors , 2000, IEEE Micro.

[88]  Pascal Sainrat,et al.  Multiple-block ahead branch predictors , 1996, ASPLOS VII.

[89]  Wen-mei W. Hwu,et al.  Vacuum packing: extracting hardware-detected program phases for post-link optimization , 2002, MICRO.

[90]  Lian-Tuu Yeh,et al.  Thermal Management of Microelectronic Equipment , 2002 .

[91]  Brad Calder,et al.  Predictor-directed stream buffers , 2000, MICRO 33.

[92]  Lizy Kurian John,et al.  Latency and energy aware value prediction for high-frequency processors , 2002, ICS '02.

[93]  Mikko H. Lipasti,et al.  Value locality and load value prediction , 1996, ASPLOS VII.

[94]  F. Gabbay Speculative Execution based on Value Prediction Research Proposal towards the Degree of Doctor of Sciences , 1996 .

[95]  Eric Sprangle,et al.  Increasing processor performance by implementing deeper pipelines , 2002, ISCA.

[96]  Andreas Moshovos,et al.  Dependence based prefetching for linked data structures , 1998, ASPLOS VIII.

[97]  Mikko H. Lipasti,et al.  Exceeding the dataflow limit via value prediction , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[98]  M. Bohr Interconnect scaling-the real limiter to high performance ULSI , 1995, Proceedings of International Electron Devices Meeting.