The McPAT Framework for Multicore and Manycore Architectures: Simultaneously Modeling Power, Area, and Timing

This article introduces McPAT, an integrated power, area, and timing modeling framework that supports comprehensive design space exploration for multicore and manycore processor configurations ranging from 90nm to 22nm and beyond. At microarchitectural level, McPAT includes models for the fundamental components of a complete chip multiprocessor, including in-order and out-of-order processor cores, networks-on-chip, shared caches, and integrated system components such as memory controllers and Ethernet controllers. At circuit level, McPAT supports detailed modeling of critical-path timing, area, and power. At technology level, McPAT models timing, area, and power for the device types forecast in the ITRS roadmap. McPAT has a flexible XML interface to facilitate its use with many performance simulators. Combined with a performance simulator, McPAT enables architects to accurately quantify the cost of new ideas and assess trade-offs of different architectures using new metrics such as Energy-Delay-Area2 Product (EDA2P) and Energy-Delay-Area Product (EDAP). This article explores the interconnect options of future manycore processors by varying the degree of clustering over generations of process technologies. Clustering will bring interesting trade-offs between area and performance because the interconnects needed to group cores into clusters incur area overhead, but many applications can make good use of them due to synergies from cache sharing. Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks for manycore designs at the 22nm technology shows that 8-core clustering gives the best energy-delay product, whereas when die area is taken into account, 4-core clustering gives the best EDA2P and EDAP.

[1]  Michael Gschwind,et al.  New methodology for early-stage, microarchitecture-level power-performance analysis of microprocessors , 2003, IBM J. Res. Dev..

[2]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[3]  Ronald G. Dreslinski,et al.  The M5 Simulator: Modeling Networked Systems , 2006, IEEE Micro.

[4]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach (4. ed.) , 2007 .

[5]  William J. Dally,et al.  A 14mW 6.25Gb/s Transceiver in 90nm CMOS for Serial Chip-to-Chip Communications , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[6]  Goichi Ono,et al.  A 12.3mW 12.5Gb/s complete transceiver in 65nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[7]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[8]  Dezsö Sima,et al.  The Design Space of Register Renaming Techniques , 2000, IEEE Micro.

[9]  Norman P. Jouppi,et al.  WRL Research Report 93/5: An Enhanced Access and Cycle Time Model for On-chip Caches , 1994 .

[10]  David A. Patterson,et al.  Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .

[11]  Mohamed I. Elmasry,et al.  Dynamic and leakage power reduction in MTCMOS circuits using an automated efficient gate clustering technique , 2002, DAC '02.

[12]  Andrew B. Kahng,et al.  ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[13]  B. Bloechel,et al.  A 4-GHz 300-mW 64-bit integer execution ALU with dual supply voltages in 90-nm CMOS , 2004, IEEE Journal of Solid-State Circuits.

[14]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[15]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[16]  James Tschanz,et al.  Parameter variations and impact on circuits and microarchitecture , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[17]  Marcelo Yuffe,et al.  A fully integrated multi-CPU, GPU and memory controller 32nm processor , 2011, 2011 IEEE International Solid-State Circuits Conference.

[18]  Takayasu Sakurai,et al.  Analysis and future trend of short-circuit power , 2000, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[19]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[20]  A. Alvandpour,et al.  A six-port 57GB/s double-pumped nonblocking router core , 2005, Digest of Technical Papers. 2005 Symposium on VLSI Circuits, 2005..

[21]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[22]  A. Kumar,et al.  A 1.2 GHz Alpha microprocessor with 44.8 GB/s chip pin bandwidth , 2001, 2001 IEEE International Solid-State Circuits Conference. Digest of Technical Papers. ISSCC (Cat. No.01CH37177).

[23]  Babak Falsafi,et al.  Cuckoo directory: A scalable directory for many-core systems , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[24]  Hee-Tae Ahn,et al.  A low-jitter 1.9-V CMOS PLL for UltraSPARC microprocessor applications , 2000, IEEE Journal of Solid-State Circuits.

[25]  John L. Henning Performance counters and development of SPEC CPU2006 , 2007, CARN.

[26]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27]  Vamsi Boppana,et al.  Accurate pre-layout estimation of standard cell characteristics , 2004, Proceedings. 41st Design Automation Conference, 2004..

[28]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[29]  R. Ho Chip Wires: Scaling and Efficiency , 2003 .

[30]  Larry L. Biro,et al.  Power considerations in the design of the Alpha 21264 microprocessor , 1998, Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175).

[31]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[32]  Shashank Gupta,et al.  Technology Independent Area and Delay Estimations for MicroprocessorBuilding Blocks , 2001 .

[33]  Ken Mai,et al.  The future of wires , 2001, Proc. IEEE.

[34]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[35]  Alan L. Cox,et al.  Achieving 10 Gb/s using safe and transparent network interface virtualization , 2009, VEE '09.

[36]  David A. Koufaty,et al.  Hyperthreading Technology in the Netburst Microarchitecture , 2003, IEEE Micro.

[37]  Lei He,et al.  Distributed sleep transistor network for power reduction , 2003, DAC '03.

[38]  Mark D. Hill,et al.  Virtual hierarchies to support server consolidation , 2007, ISCA '07.

[39]  Ying Zhang,et al.  A 4.0 GHz 291 Mb Voltage-Scalable SRAM Design in a 32 nm High-k + Metal-Gate CMOS Technology With Integrated Power Management , 2010, IEEE Journal of Solid-State Circuits.

[40]  S. Tam,et al.  A Dual-Core Multi-Threaded Xeon Processor with 16MB L3 Cache , 2006, 2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers.

[41]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[42]  Goichi Ono,et al.  A 12.3-mW 12.5-Gb/s Complete Transceiver in 65-nm CMOS Process , 2010, IEEE Journal of Solid-State Circuits.

[43]  Ken Smits,et al.  Penryn: 45-nm next generation Intel® core™ 2 processor , 2007, 2007 IEEE Asian Solid-State Circuits Conference.

[44]  Timothy Johnson,et al.  An 8-core, 64-thread, 64-bit power efficient sparc soc (niagara2) , 2007, ISPD '07.

[45]  Krste Asanovic,et al.  Controlling program execution through binary instrumentation , 2005, CARN.

[46]  A. Kumar,et al.  Implementation of an 8-Core, 64-Thread, Power-Efficient SPARC Server on a Chip , 2008, IEEE Journal of Solid-State Circuits.

[47]  Mike Butler “Bulldozer” a new approach to mult ithreaded compute performance , 2010, 2010 IEEE Hot Chips 22 Symposium (HCS).

[48]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[49]  Rajesh Kumar,et al.  A family of 45nm IA processors , 2009, 2009 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[50]  Anantha Chandrakasan,et al.  MTCMOS hierarchical sizing based on mutual exclusive discharge patterns , 1998, Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175).

[51]  Mahmut T. Kandemir,et al.  Energy-driven integrated hardware-software optimizations using SimplePower , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[52]  Jung Ho Ahn,et al.  A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies , 2008, 2008 International Symposium on Computer Architecture.

[53]  George Kurian,et al.  Graphite: A distributed parallel simulator for multicores , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[54]  Ha Pham,et al.  A 40nm 16-core 128-thread CMT SPARC SoC processor , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[55]  C. Auth,et al.  45nm High-k + metal gate strain-enhanced transistors , 2008, 2008 Symposium on VLSI Technology.

[56]  Chris Auth,et al.  45nm high-k + metal gate strain-enhanced CMOS transistors , 2008, 2008 IEEE Custom Integrated Circuits Conference.

[57]  W. C. Elmore The Transient Response of Damped Linear Networks with Particular Regard to Wideband Amplifiers , 1948 .

[58]  Benjamin Bishop,et al.  The design of a register renaming unit , 1999, Proceedings Ninth Great Lakes Symposium on VLSI.

[59]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[60]  Mark Horowitz,et al.  Timing Models for MOS Circuits , 1983 .

[61]  Bruce Jacob,et al.  The structural simulation toolkit , 2006, PERV.

[62]  Dean M. Tullsen,et al.  Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[63]  Jung Ho Ahn,et al.  CACTI-P: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques , 2011, 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[64]  Norman P. Jouppi,et al.  Conjoined-Core Chip Multiprocessing , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[65]  Kevin Skadron,et al.  HotSpot: a compact thermal modeling methodology for early-stage VLSI design , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[66]  Alon Naveh,et al.  Power and Thermal Management in the Intel Core Duo Processor , 2006 .

[67]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[68]  Thomas Krause,et al.  A 12.5Gb/s SerDes in 65nm CMOS Using a Baud-Rate ADC with Digital Receiver Equalization and Clock Recovery , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[69]  S. Naffziger,et al.  Clock distribution on a dual-core, multi-threaded Itanium/sup /spl reg//-family processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[70]  Pedro López,et al.  Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors , 2007, 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07).

[71]  Stephen H. Gunther,et al.  Managing the Impact of Increasing Microprocessor Power Consumption , 2001 .