Wimpy or brawny cores: A throughput perspective

In this paper, we conduct a coarse-granular comparative analysis of wimpy (i.e., simple) fine-grain multicore processors against brawny (i.e., complex) simultaneous multithreaded (SMT) multicore processors for server applications with strong request-level parallelism. We explore a large design space along multiple dimensions, including the number of cores, the number of threads, and a wide range of workloads. For strong CPU-bound workload, a 2R-core wimpy-multicore processor is found to be on par with an R-core brawny-multicore processor in terms of throughput performance. For strong memory-bound workload, core-level multithreading is largely ineffective for both wimpy-multicore and brawny-multicore processors, except for the case of low core and thread counts per memory/disk interface. For both wimpy-multicore and brawny-multicore, there is an optimal core number at which the highest throughput performance is achieved, which reduces, as the workload becomes deeper memory-bound. Moreover, there is a threshold core number for a wimpy-multicore, beyond which it is outperformed by its brawny-multicore counterpart. These behaviors indicate that brawny-multicores are better choices than wimpy-multicores in terms of throughput performance.

[1]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[2]  Jean-Luc Gaudiot,et al.  Area and system clock effects on SMT/CMP throughput , 2005, IEEE Transactions on Computers.

[3]  Lu Peng,et al.  Memory Performance and Scalability of Intel's and AMD's Dual-Core Processors: A Case Study , 2007, 2007 IEEE International Performance, Computing, and Communications Conference.

[4]  Tor M. Aamodt,et al.  A first-order fine-grained multithreaded throughput model , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[5]  Ryan E. Grant,et al.  A Comprehensive Analysis of OpenMP Applications on Dual-Core Intel Xeon SMPs , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[6]  Zhen Liu,et al.  Revisiting the Cache Effect on Multicore Multithreaded Network Processors , 2008, 2008 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools.

[7]  Stijn Eyerman,et al.  Modeling critical sections in Amdahl's law and its implications for multicore design , 2010, ISCA '10.

[8]  Xian-He Sun,et al.  Reevaluating Amdahl's law in the multicore era , 2010, J. Parallel Distributed Comput..

[9]  M. Meerschaert,et al.  Parameter Estimation for the Truncated Pareto Distribution , 2006 .

[10]  Miao Ju,et al.  A Performance Analysis Methodology for Multicore, Multithreaded Processors , 2014, IEEE Transactions on Computers.

[11]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008 .

[12]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[13]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[14]  Lieven Eeckhout,et al.  Deformable Surface 3D Reconstruction from Monocular Images , 2010 .

[15]  Sally A. McKee,et al.  Efficiently exploring architectural design spaces via predictive modeling , 2006, ASPLOS XII.

[16]  Coniferous softwood GENERAL TERMS , 2003 .

[17]  Michael F. P. O'Boyle,et al.  Microarchitectural Design Space Exploration Using an Architecture-Centric Approach , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[18]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[19]  Gunter Bolch,et al.  Queueing Networks and Markov Chains , 2005 .

[20]  Lieven Eeckhout,et al.  Computer Architecture Performance Evaluation Methods , 2010, Computer Architecture Performance Evaluation Methods.

[21]  Amar Phanishayee,et al.  FAWN: a fast array of wimpy nodes , 2009, SOSP '09.

[22]  Vittorio Zaccaria,et al.  A correlation-based design space exploration methodology for multi-processor systems-on-chip , 2010, Design Automation Conference.

[23]  Marc Tremblay,et al.  High-performance throughput computing , 2005, IEEE Micro.

[24]  Vidhyacharan Bhaskar,et al.  A closed queuing network model with multiple servers for multi-threaded architecture , 2008, Comput. Commun..

[25]  Barbara M. Chapman,et al.  Evaluating OpenMP on Chip MultiThreading Platforms , 2005, IWOMP.

[26]  Milos Prvulovic,et al.  PEEP: Exploiting predictability of memory dependences in SMT processors , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[27]  Angela C. Sodan,et al.  Parallelism via Multithreaded and Multicore CPUs , 2010, Computer.

[28]  David M. Brooks,et al.  Illustrative Design Space Studies with Microarchitectural Regression Models , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[29]  Gianfranco Balbo,et al.  Computational algorithms for closed queueing networks , 1980 .

[30]  Guang R. Gao,et al.  Analysis of multithreaded multiprocessors with distributed shared memory , 1993, Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed Processing.

[31]  Alexander S. Szalay,et al.  GrayWulf: Scalable Clustered Architecture for Data Intensive Computing , 2009, 2009 42nd Hawaii International Conference on System Sciences.

[32]  Urs Hölzle,et al.  Brawny cores still beat wimpy cores, most of the time , 2010 .