Performance/Watt: the new server focus

Transaction processing has emerged as the killer application for commercial servers. Most servers are engaged in transactional workloads such as processing search requests, serving middleware, evaluating decisions, managing databases, and powering online commerce. Currently, commercial servers are built from one or more high-performance superscalar processors. However, commercial server applications exhibit high cache miss rates, large memory footprints, and low instruction level parallelism (ILP), which leads to poor utilization on traditional ILP-focused superscalar processors [11]. In addition, these ILP-focused processors have been primarily optimized to deliver maximum performance by employing high clock rates and large amounts of speculation. As a result, we are now at the point where the performance/Watt of subsequent generations of traditional ILP-focused processors on server workloads has been flat [4] or even decreasing. The lack of increase in processor performance/Watt, coupled with the continued decrease in server hardware acquisition costs and likely increases in future power and cooling costs is leading to a situation where total cost of server ownership will soon be predominately determined by power [4]. In this paper, we argue that attacking thread-level parallelism (TLP) via a large number of simple cores on a chip multiprocessor (CMP) leads to much better performance/Watt for server workloads. As a case study, we compare Sun's TLP-oriented Niagara processor against the ILP-oriented dual-core Pentium Extreme Edition from Intel, showing that the Niagara processor has a significant performance/Watt advantage for throughput-oriented server applications.

[1]  Donald Yeung,et al.  Sparcle: an evolutionary processor design for large-scale multiprocessors , 1993, IEEE Micro.

[2]  Anoop Gupta,et al.  Interleaving: a multithreading technique targeting multiprocessors and workstations , 1994, ASPLOS VI.

[3]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[4]  Trevor N. Mudge,et al.  Author retrospective improving data cache performance by pre-executing instructions under a cache miss , 1997, International Conference on Supercomputing.

[5]  Luiz André Barroso,et al.  Memory system characterization of commercial workloads , 1998, ISCA.

[6]  Susan J. Eggers,et al.  An analysis of database workload performance on simultaneous multithreaded processors , 1998, ISCA.

[7]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[8]  Mikko H. Lipasti,et al.  A performance methodology for commercial servers , 2000, IBM J. Res. Dev..

[9]  Irith Pomeranz,et al.  Transient-fault recovery using simultaneous multithreading , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[10]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[11]  Norman P. Jouppi,et al.  The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays , 2002, ISCA.

[12]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[13]  David A. Koufaty,et al.  Hyperthreading Technology in the Netburst Microarchitecture , 2003, IEEE Micro.

[14]  S. Chu,et al.  Design and Implementation of the POWER5 TM Microprocessor , 2004 .

[15]  XML Processing Performance in Java and .NET , 2004 .

[16]  Balaram Sinharoy,et al.  Design and implementation of the POWER5 microprocessor , 2004, Proceedings. 41st Design Automation Conference, 2004..

[17]  Mike Lee,et al.  Design and Implementation of the POWER5 TM Microprocessor , 2004 .

[18]  Marc Tremblay,et al.  High-performance throughput computing , 2005, IEEE Micro.

[19]  Luiz André Barroso,et al.  The Price of Performance , 2005, ACM Queue.

[20]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[21]  Kunle Olukotun,et al.  Maximizing CMP throughput with mediocre cores , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[22]  S. Naffziger,et al.  Power and temperature control on a 90nm Itanium/sup /spl reg//-family processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[23]  S. Naffziger,et al.  Power and temperature control on a 90-nm Itanium family processor , 2006, IEEE Journal of Solid-State Circuits.

[24]  E. Alon,et al.  The implementation of a 2-core, multi-threaded itanium family processor , 2006, IEEE Journal of Solid-State Circuits.