Power-Performance Comparison of Single-Task Driven Many-Cores

Many-cores, processors with 100s of cores, are becoming increasingly popular in general-purpose computing, yet power is a limiting factor in their performance. In this paper, we compare the power and performance of two design points in the many-core processor domain. The XMT general-purpose processor provides significant runtime advantage on irregular parallel programs (e.g., graph algorithms). This was previously demonstrated and tied to its architecture choices and ease-of-programming. In contrast, current commercial GPUs excel at regular parallel programs that require high processing capability. In this work, we set the power envelope as a constraint and evaluate an envisioned 1024-core XMT processor against an NVIDIA GTX280 GPU considering various scenarios for estimating the power of the XMT chip. Even under worst-case assumptions and scenarios, simulations show that the XMT processor sustains its advantage over the GPU on irregular parallel programs, while not falling significantly behind on regular programs. The total energy spent per benchmark fits a similar pattern. Given that the two architectures target different types of parallelism, a future system can potentially utilize an XMT chip and a GPU chip in complementary roles.

[1]  Yu Liu,et al.  Scheduling for energy efficiency and fault tolerance in hard real-time systems , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[2]  George C. Caragea,et al.  Brief announcement: performance potential of an easy-to-program PRAM-on-chip prototype versus state-of-the-art processor , 2009, SPAA '09.

[3]  Christoph W. Kessler,et al.  Practical PRAM programming , 2000, Wiley series on parallel and distributed computing.

[4]  Michael Garland,et al.  Designing efficient sorting algorithms for manycore GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[5]  Margaret Martonosi,et al.  Runtime power monitoring in high-end processors: methodology and empirical data , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[6]  Uzi Vishkin,et al.  A pilot study to compare programming effort for two parallel programming models , 2007, J. Syst. Softw..

[7]  Andrew B. Kahng,et al.  ORION 2.0: A Power-Area Simulator for Interconnection Networks , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[8]  Uzi Vishkin,et al.  XMT-GPU: A PRAM Architecture for Graphics Computation , 2008, 2008 37th International Conference on Parallel Processing.

[9]  George C. Caragea,et al.  General-Purpose vs . GPU : Comparison of Many-Cores on Irregular Workloads , 2010 .

[10]  Gang Qu,et al.  Layout-Accurate Design and Implementation of a High-Throughput Interconnection Network for Single-Chip Parallel Processing , 2007 .

[11]  Joseph JáJá,et al.  An Introduction to Parallel Algorithms , 1992 .

[12]  Sanguthevar Rajasekaran,et al.  Handbook of Parallel Computing - Models, Algorithms and Applications , 2007 .

[13]  Natalie D. Enright Jerger,et al.  Outstanding Research Problems in NoC Design: System, Microarchitecture, and Circuit Perspectives , 2009, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[14]  Kevin Skadron,et al.  Many-core design from a thermal perspective , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[15]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.

[16]  Uzi Vishkin,et al.  Using simple abstraction to reinvent computing for parallelism , 2011, Commun. ACM.

[17]  Norman P. Jouppi,et al.  CACTI: an enhanced cache access and cycle time model , 1996, IEEE J. Solid State Circuits.

[18]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[19]  Fuat Keceli,et al.  Toolchain for Programming, Simulating and Studying the XMT Many-Core Architecture , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[20]  Hyesoon Kim,et al.  An integrated GPU power and performance model , 2010, ISCA.

[21]  Uzi Vishkin,et al.  Explicit multi-threading (XMT) bridging models for instruction parallelism (extended abstract) , 1998, SPAA '98.

[22]  Jiang Zhu,et al.  Building a RCP (Rate Control Protocol) Test Network , 2007 .

[23]  Andrew A. Chien,et al.  The future of microprocessors , 2011, Commun. ACM.

[24]  Uzi Vishkin,et al.  Towards a First Vertical Prototyping of an Extremely Fine-Grained Parallel Programming Approach , 2003, Theory of Computing Systems.

[25]  S. Nassif,et al.  Full chip leakage-estimation considering power supply and temperature variations , 2003, Proceedings of the 2003 International Symposium on Low Power Electronics and Design, 2003. ISLPED '03..

[26]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[27]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[28]  Uzi Vishkin,et al.  Fpga-based prototype of a pram-on-chip processor , 2008, CF '08.

[29]  Coniferous softwood GENERAL TERMS , 2003 .

[30]  George C. Caragea,et al.  Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform , 2006, Handbook of Parallel Computing.

[31]  Ralph Grishman,et al.  The NYU ultracomputer—designing a MIMD, shared-memory parallel machine , 2018, ISCA '98.

[32]  Uzi Vishkin,et al.  PRAM-on-chip: first commitment to silicon , 2007, SPAA '07.

[33]  Uzi Vishkin,et al.  Using Simple Abstraction to Guide the Reinvention of Computing for Parallelism , 2009 .

[34]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[35]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[36]  Uzi Vishkin,et al.  Is teaching parallel algorithmic thinking to high school students possible?: one teacher's experience , 2010, SIGCSE.

[37]  Kevin Skadron,et al.  Temperature-aware microarchitecture , 2003, ISCA '03.

[38]  Aydin O. Balkan Mesh-of-Trees Interconnection Network for an Explicitly Multi-Threaded Parallel Computer Architecture , 2008 .

[39]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[40]  Ralph Grishman,et al.  The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer , 1983, IEEE Transactions on Computers.