WiDGET: Wisconsin decoupled grid execution tiles

The recent paradigm shift to multi-core systems results in high system throughput within a specified power budget. However, future systems still require good single thread performance--no longer the predominant design priority--to mitigate sequential bottlenecks and/or to guarantee service-level agreements. Unfortunately, near saturation in voltage scaling necessitates a long-term alternative to dynamic voltage and frequency scaling. We propose an energy-proportional computing infrastructure, called WiDGET, that decouples thread context management from a sea of simple execution units (EUs). WiDGET's decoupled design provides flexibility to alter resource allocation for a particular power-performance target while turning off unallocated resources. In other words, WiDGET enables dynamic customization of different combinations of small and/or powerful cores on a single chip, consuming power in proportion to the delivered performance. Over all SPEC CPU2006 benchmarks, WiDGET provides average per-thread performance that is 26% better than a Xeon-like processor while using 8% less power. WiDGET can also scale down to a level comparable to an Atom-like processor, turning off resources to reduce average power by 58%. WiDGET achieves high power efficiency (BIPS3/W), exceeding Xeon-like and Atom-like processors by up to 2x and 21x, respectively.

[1]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, ISCA.

[2]  Thomas F. Wenisch,et al.  PowerNap: eliminating server idle power , 2009, ASPLOS.

[3]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[4]  Belliappa Kuttanna,et al.  A Sub-2 W Low Power IA Processor for Mobile Internet Devices in 45 nm High-k Metal Gate CMOS , 2009, IEEE Journal of Solid-State Circuits.

[5]  José González,et al.  Dynamic cluster resizing , 2003, Proceedings 21st International Conference on Computer Design.

[6]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[7]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures for multithreaded workload performance , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[8]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[9]  Simha Sethumadhavan,et al.  Architecture And Implementation Of The Trips Processor , 2007 .

[10]  David Blaauw,et al.  Theoretical and practical limits of dynamic voltage scaling , 2004, Proceedings. 41st Design Automation Conference, 2004..

[11]  Yale N. Patt,et al.  Achieving Out-of-Order Performance with Almost In-Order Complexity , 2008, 2008 International Symposium on Computer Architecture.

[12]  Jose Renau,et al.  Energy-Efficient Thread-Level Speculation on a CMP , 2005 .

[13]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008 .

[14]  G. Magklis,et al.  Dynamic Frequency and Voltage Scaling for a Multiple-Clock-Domain Microprocessor , 2003, IEEE Micro.

[15]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[16]  Milo M. K. Martin,et al.  NoSQ: Store-Load Communication without a Store Queue , 2007, IEEE Micro.

[17]  Thomas R. Puzak,et al.  Optimum power/performance pipeline depth , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[18]  S. Tam,et al.  A 65nm 95W Dual-Core Multi-Threaded Xeon® Processor with L3 Cache , 2006, 2006 IEEE Asian Solid-State Circuits Conference.

[19]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[20]  Gu-Yeon Wei,et al.  Thread motion: fine-grained power management for multi-core systems , 2009, ISCA '09.

[21]  S. Winkel Optimal versus Heuristic Global Code Scheduling , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[22]  Luiz André Barroso,et al.  The Case for Energy-Proportional Computing , 2007, Computer.

[23]  Ramon Canal,et al.  A cost-effective clustered architecture , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[24]  Pradip Bose,et al.  Microarchitectural techniques for power gating of execution units , 2004, Proceedings of the 2004 International Symposium on Low Power Electronics and Design (IEEE Cat. No.04TH8758).

[25]  Jose Renau,et al.  Energy-Efficient Thread-Level Speculation , 2006, IEEE Micro.

[26]  Kunle Olukotun,et al.  The Stanford Hydra CMP , 2000, IEEE Micro.

[27]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[28]  Ho-Seop Kim,et al.  An instruction set and microarchitecture for instruction level distributed processing , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[29]  Anantha P. Chandrakasan,et al.  Low-power CMOS digital design , 1992 .

[30]  Soraya Ghiasi,et al.  System power management support in the IBM POWER6 microprocessor , 2007, IBM J. Res. Dev..

[31]  Michael C. Huang,et al.  Dynamically Tuning Processor Resources with Adaptive Processing , 2003, Computer.

[32]  A. Roth,et al.  Register integration: a simple and efficient implementation of squash reuse , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[33]  Craig B. Zilles,et al.  Fundamental performance constraints in horizontal fusion of in-order cores , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[34]  Kushagra Vaid,et al.  Web Search Using Small Cores: Quantifying the Price of Efficiency , 2009 .

[35]  Pradeep Dubey,et al.  Larrabee: A Many-Core x86 Architecture for Visual Computing , 2009, IEEE Micro.

[36]  Engin Ipek,et al.  Core fusion: accommodating software diversity in chip multiprocessors , 2007, ISCA '07.