Over-provisioned multicore systems

Technology scaling has provided system designers with an exploding transistor budget, far more than what was available when the core principles behind many existing commodity microprocessors were envisioned. This tremendous growth also brings forth a whole new set of engineering challenges involving power density, thermal efficiency, and so on. In particular, the power constraint is becoming a first order design consideration in microprocessor designs. In the landscape of general purpose processors, power limited designs designate a significant paradigm shift from the area limited designs of the past. This dissertation proposes a model to capture the first order impact of the power constraint. Denoted as the Simultaneously Active Fraction (SAF), this metric represents the fraction of the entire chip resources that can be active simultaneously, given a target power envelope. As the improvement in the energy efficiency of individual transistor devices lags behind the growth in their integration capacity, the dissertation finds that the SAF is monotonically decreasing in each successive technology generation. In the context of rapidly shrinking SAF, this dissertation investigates a novel multicore design paradigm: Over-provisioned Multicore System (OPMS). An OPMS is a class of multicores that by design provision more processing core resources than that can be kept active for their target Thermal Design Power (TDP). Since only a subset of the on-chip cores are active at any given time, this design paradigm affords tremendous flexibility in assigning computation on processing cores, facilitating many novel techniques in this broad framework. To demonstrate a concrete application of this framework, the dissertation proposes Computation Spreading (CSP): a new model for distributing the collective work from multithreaded applications. CSP aims to collocate similar computation fragments from different threads on the same core, while distributing dissimilar computation fragments from the same thread across multiple cores. Under CSP, on-chip cores in an OPMS are dynamically specialized via retaining mutually exclusive predictive states. The dissertation demonstrates the effectiveness of CSP in an OPMS through a rigorous evaluation of performance, energy efficiency, and several design trade-offs.

[1]  Kaushik Roy,et al.  A fully physical model for leakage distribution under process variations in nanoscale double-gate CMOS , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[2]  Michael Gschwind,et al.  New methodology for early-stage, microarchitecture-level power-performance analysis of microprocessors , 2003, IBM J. Res. Dev..

[3]  T. Mudge,et al.  Drowsy caches: simple techniques for reducing leakage power , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[4]  Josep Torrellas,et al.  Benefits of cache-affinity scheduling in shared-memory multiprocessors: a summary , 1993, SIGMETRICS '93.

[5]  Uri C. Weiser,et al.  Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors , 2006, IEEE Computer Architecture Letters.

[6]  David J. DeWitt,et al.  DBMSs on a Modern Processor: Where Does Time Go? , 1999, VLDB.

[7]  Erich M. Nahum,et al.  Locality-aware request distribution in cluster-based network servers , 1998, ASPLOS VIII.

[8]  Kevin Skadron,et al.  CMP design space exploration subject to physical constraints , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[9]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[10]  Babak Falsafi,et al.  Database Servers on Chip Multiprocessors: Limitations and Opportunities , 2007, CIDR.

[11]  Wolf-Dietrich Weber,et al.  Power provisioning for a warehouse-sized computer , 2007, ISCA '07.

[12]  James E. Smith,et al.  Virtual machines - versatile platforms for systems and processes , 2005 .

[13]  Jose Renau,et al.  Power model validation through thermal measurements , 2007, ISCA '07.

[14]  Kunihiro Suzuki Parasitic capacitance of submicrometer MOSFET's , 1999 .

[15]  Anant Agarwal,et al.  Scalar operand networks , 2005, IEEE Transactions on Parallel and Distributed Systems.

[16]  E. Fluhr,et al.  Design and Implementation of the POWER6 Microprocessor , 2008, IEEE Journal of Solid-State Circuits.

[17]  Barry Dennington Low Power Design from Technology Challenge to Great Products , 2006, ISLPED'06 Proceedings of the 2006 International Symposium on Low Power Electronics and Design.

[18]  Srihari Makineni,et al.  Communist, Utilitarian, and Capitalist cache policies on CMPs: Caches as a shared resource , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[19]  Dhabaleswar K. Panda,et al.  Understanding the Impact of Multi-Core Architecture in Cluster Computing: A Case Study with Intel Dual-Core System , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[20]  Margaret Martonosi,et al.  An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[21]  Lixin Zhang,et al.  Adaptive mechanisms and policies for managing cache hierarchies in chip multiprocessors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[22]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[23]  Gil Neiger,et al.  Intel virtualization technology , 2005, Computer.

[24]  Patrik Larsson,et al.  di/dt Noise in CMOS Integrated Circuits , 1997 .

[25]  Jian Li,et al.  Dynamic power-performance adaptation of parallel computation on chip multiprocessors , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[26]  S. Parekh,et al.  An analysis of database workload performance on simultaneous multithreaded processors , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[27]  James R. Larus,et al.  Using Cohort-Scheduling to Enhance Server Performance , 2002, USENIX Annual Technical Conference, General Track.

[28]  Josep Torrellas,et al.  Paceline: Improving Single-Thread Performance in Nanoscale CMPs through Core Overclocking , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[29]  S. Narendra,et al.  Modeling of parasitic capacitances in deep submicrometer conventional and high-K dielectric MOS transistors , 2003 .

[30]  Alan Jay Smith,et al.  Cache Memories , 1982, CSUR.

[31]  Kaushik Roy,et al.  Leakage Power Analysis and Reduction for Nanoscale Circuits , 2006, IEEE Micro.

[32]  Ramon Canal,et al.  Design space exploration for multicore architectures: a power/performance/thermal view , 2006, ICS '06.

[33]  Saibal Mukhopadhyay,et al.  Leakage current mechanisms and leakage reduction techniques in deep-submicrometer CMOS circuits , 2003, Proc. IEEE.

[34]  Siva G. Narendra,et al.  Leakage in Nanometer CMOS Technologies , 2010 .

[35]  Josep Torrellas,et al.  Characterizing the caching and synchronization performance of a multiprocessor operating system , 1992, ASPLOS V.

[36]  E. You,et al.  A third-generation SPARC V9 64-b microprocessor , 2000, IEEE Journal of Solid-State Circuits.

[37]  David A. Wood,et al.  Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[38]  J. E. Thornton,et al.  Parallel operation in the control data 6600 , 1964, AFIPS '64 (Fall, part II).

[39]  Shekhar Y. Borkar,et al.  Design challenges of technology scaling , 1999, IEEE Micro.

[40]  Aamer Jaleel,et al.  Last level cache (LLC) performance of data mining workloads on a CMP - a case study of parallel bioinformatics workloads , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[41]  Mohamed I. Elmasry,et al.  Dynamic Standby Prediction for Leakage Tolerant Microprocessor Functional Units , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[42]  Narayanan Vijaykrishnan,et al.  Understanding and improving operating system effects in control flow prediction , 2002, ASPLOS X.

[43]  Rami Melhem,et al.  The effects of energy management on reliability in real-time embedded systems , 2004, ICCAD 2004.

[44]  Marcelo Yuffe,et al.  The Implementation of the 65nm Dual-Core 64b Merom Processor , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[45]  Pradip Bose,et al.  Stretching the limits of clock-gating efficiency in server-class processors , 2005, 11th International Symposium on High-Performance Computer Architecture.

[46]  Eby G. Friedman,et al.  Managing static leakage energy in microprocessor functional units , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[47]  Shekhar Y. Borkar,et al.  Microarchitecture and Design Challenges for Gigascale Integration , 2004, MICRO.

[48]  E. Alon,et al.  Digital Circuit Design Trends , 2008, IEEE Journal of Solid-State Circuits.

[49]  James Tschanz,et al.  Parameter variations and impact on circuits and microarchitecture , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[50]  Meeta Sharma Gupta,et al.  System level analysis of fast, per-core DVFS using on-chip switching regulators , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[51]  Bradford M. Beckmann,et al.  Managing wire delay in chip multiprocessor caches , 2006 .

[52]  David E. Culler,et al.  SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[53]  S. Tam,et al.  A 65-nm Dual-Core Multithreaded Xeon® Processor With 16-MB L3 Cache , 2007, IEEE Journal of Solid-State Circuits.

[54]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[55]  Ravi Rajwar,et al.  The impact of performance asymmetry in emerging multicore architectures , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[56]  Luiz André Barroso,et al.  Memory system characterization of commercial workloads , 1998, ISCA.

[57]  John L. Henning,et al.  Subroutine profiling results for the CPU2006 benchmarks , 2007, CARN.

[58]  Richard McDougall,et al.  Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture , 2006 .

[59]  Kevin Skadron,et al.  Temperature-Aware Microarchitecture: Extended Discussion and Results , 2003 .

[60]  T. N. Vijaykumar,et al.  Heat-and-run: leveraging SMT and CMP to manage power density through the operating system , 2004, ASPLOS XI.

[61]  Roland E. Wunderlich,et al.  SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[62]  Zhen Yang,et al.  CMP cache performance projection: accessibility vs. capacity , 2007, CARN.

[63]  Huiyang Zhou,et al.  Dual-core execution: building a highly scalable single-thread instruction window , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[64]  Paul D. Franzon,et al.  Configurable string matching hardware for speeding up intrusion detection , 2005, CARN.

[65]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[66]  Anastasia Ailamaki,et al.  STEPS towards Cache-resident Transaction Processing , 2004, VLDB.

[67]  Michael D. Smith,et al.  An Analysis of Dynamic Branch Prediction Schemes on System Workloads , 1996, International Symposium on Computer Architecture.

[68]  John L. Henning Performance counters and development of SPEC CPU2006 , 2007, CARN.

[69]  Massoud Pedram,et al.  Charge recycling in MTCMOS circuits: concept and analysis , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[70]  Mark Horowitz,et al.  Cache performance of operating system and multiprogramming workloads , 1988, TOCS.

[71]  Gurindar S. Sohi,et al.  A static power model for architects , 2000, MICRO 33.

[72]  John L. Henning SPEC CPU2006 memory footprint , 2007, CARN.

[73]  David A. Wood,et al.  Variability in architectural simulations of multi-threaded workloads , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[74]  Koushik Chakraborty,et al.  Computation spreading: employing hardware migration to specialize CMP cores on-the-fly , 2006, ASPLOS XII.

[75]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[76]  Trevor N. Mudge,et al.  Power: A First-Class Architectural Design Constraint , 2001, Computer.

[77]  John Paul Shen,et al.  Scaling and characterizing database workloads: bridging the gap between research and practice , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[78]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[79]  Gurindar S. Sohi,et al.  Serializing instructions in system-intensive workloads: Amdahl’s Law strikes again , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[80]  Yuanyuan Zhou,et al.  Managing energy-performance tradeoffs for multithreaded applications on multiprocessor architectures , 2007, SIGMETRICS '07.

[81]  Carl A. Waldspurger,et al.  Memory resource management in VMware ESX server , 2002, OSDI '02.

[82]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures for multithreaded workload performance , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[83]  Koushik Chakraborty,et al.  Adapting to intermittent faults in multicore systems , 2008, ASPLOS.

[84]  David R. Butenhof Programming with POSIX threads , 1993 .

[85]  Kevin Skadron,et al.  Low-Power Design and Temperature Management , 2007, IEEE Micro.

[86]  Norman P. Jouppi,et al.  Enterprise IT trends and implications for architecture research , 2005, 11th International Symposium on High-Performance Computer Architecture.

[87]  M. K. Gowan,et al.  A 65nm 2-Billion-Transistor Quad-Core Itanium® Processor , 2008, 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[88]  David A. Wood,et al.  Full-system timing-first simulation , 2002, SIGMETRICS '02.

[89]  Susan J. Eggers,et al.  An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture , 2000, ASPLOS.

[90]  Jaehyuk Huh,et al.  Exploring the design space of future CMPs , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[91]  Brian N. Bershad,et al.  The interaction of architecture and operating system design , 1991, ASPLOS IV.

[92]  Jichuan Chang,et al.  Cooperative Caching for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[93]  Gordon E. Moore,et al.  Progress in digital integrated electronics , 1975 .

[94]  Dominique Lavenier,et al.  SAMBA: hardware accelerator for biological sequence comparison , 1997, Comput. Appl. Biosci..

[95]  Shyamkumar Thoziyoor,et al.  1 CACTI 4 . 0 , 2006 .

[96]  Pradip Bose,et al.  Microarchitectural techniques for power gating of execution units , 2004, Proceedings of the 2004 International Symposium on Low Power Electronics and Design (IEEE Cat. No.04TH8758).

[97]  David Harris,et al.  CMOS VLSI Design: A Circuits and Systems Perspective , 2004 .

[98]  Lei He,et al.  Temperature-Aware Performance and Power Modeling , 2004 .

[99]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008 .

[100]  A. Kumar,et al.  Implementation of an 8-Core, 64-Thread, Power-Efficient SPARC Server on a Chip , 2008, IEEE Journal of Solid-State Circuits.

[101]  Alexandra Fedorova,et al.  Operating System Scheduling On Heterogeneous Core Systems , 2007 .

[102]  Paul Barford,et al.  Generating representative Web workloads for network and server performance evaluation , 1998, SIGMETRICS '98/PERFORMANCE '98.

[103]  Mark Horowitz,et al.  Scaling, Power and the Future of CMOS , 2007, 20th International Conference on VLSI Design held jointly with 6th International Conference on Embedded Systems (VLSID'07).

[104]  S. Borkar,et al.  Dynamic-sleep transistor and body bias for active leakage power control of microprocessors , 2003, 2003 IEEE International Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC..

[105]  M.D. Powell,et al.  Pipeline damping: a microarchitectural technique to reduce inductive noise in supply voltage , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[106]  Gaurav Mittal,et al.  Design of the Power6 Microprocessor , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[107]  Scott A. Mahlke,et al.  VEAL: Virtualized Execution Accelerator for Loops , 2008, 2008 International Symposium on Computer Architecture.

[108]  Jichuan Chang,et al.  Cooperative cache partitioning for chip multiprocessors , 2007, ICS '07.