Performance Benefits of Heterogeneous Computing in HPC Workloads

Chip multi-processors (CMPs) with increasing number of processor cores are now becoming widely available. To take advantage of many-core CMPs, applications must be parallelized. However, due to the nature of algorithm/programming model, some parts of the application would remain serial. According to Amdahl's law, the speedup of a parallel application is limited by the amount of serial execution it has. For a CMP with many cores, this can be a serious limitation. To take full advantage of the increasing number of cores, one must try to reduce the execution time of the serial portion of a parallel program. However, rewriting an application takes time and often the return on the effort invested may not justify parallelizing every part of the program. Heterogeneous many-core CMP design is one possible solution to support massive parallel execution and to provide a reasonable single-thread performance. In this paper, we use a simple spreadsheet model to evaluate homogeneous and heterogeneous CMP designs using execution profiles of real HPC applications. Evaluated on 12 parallel HPC applications, we show that heterogeneous CMPs can outperform homogeneous CMPs by up to 1.35× with an average speedup of 1.06× when both the heterogeneous CMPs and homogeneous CMPs are constrained to use the same power budget. Our study found the heterogeneous CMPs can take advantage of serial portion of execution that is as little as 2% of total run time to provide performance benefit. This suggests heterogeneous computing can help mitigate the effect of not parallelizing some portions of an application due to return on investment concern on programming efforts.

[1]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[2]  Balaram Sinharoy,et al.  IBM POWER7 multicore server processor , 2011 .

[3]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[4]  Chenming Hu,et al.  Future CMOS scaling and reliability , 1993, Proc. IEEE.

[5]  Michael C. Huang,et al.  Dynamically Tuning Processor Resources with Adaptive Processing , 2003, Computer.

[6]  Kunle Olukotun,et al.  The Stanford Hydra CMP , 2000, IEEE Micro.

[7]  John Paul Shen,et al.  Mitigating Amdahl's law through EPI throttling , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[8]  S. Asano,et al.  The design and implementation of a first-generation CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[9]  Fabrizio Ferrandi,et al.  Caronte: a complete methodology for the implementation of partially dynamically self-reconfiguring systems on FPGA platforms , 2005, 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'05).

[10]  Irvin J. Lustig,et al.  Gigaflops in linear programming , 1996, Oper. Res. Lett..

[11]  James E. Smith,et al.  Configurable isolation: building high availability systems with commodity multi-core processors , 2007, ISCA '07.

[12]  Doug Burger,et al.  Exploiting microarchitectural redundancy for defect tolerance , 2003, Proceedings 21st International Conference on Computer Design.

[13]  Walter Stechele,et al.  Using Partial-Run-Time Reconfigurable Hardware to accelerate Video Processing in Driver Assistance System , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[14]  Kevin Skadron,et al.  Scaling with Design Constraints: Predicting the Future of Big Chips , 2011, IEEE Micro.

[15]  Soraya Ghiasi,et al.  Scheduling for heterogeneous processors in server systems , 2005, CF '05.

[16]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[17]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[18]  William J. Dally,et al.  Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[19]  Doug Burger,et al.  Exploiting microarchitectural redundancy for defect tolerance , 2003, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[20]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[21]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[22]  Margaret Martonosi,et al.  Dynamic thermal management for high-performance microprocessors , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[23]  Sadaf R. Alam,et al.  Using FPGA Devices to Accelerate Biomolecular Simulations , 2007, Computer.

[24]  Mi Lu,et al.  Accelerating seismic migration using FPGA-based coprocessor platform , 2004, 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[25]  Bobby Bodenheimer,et al.  Synthesis and evaluation of linear motion transitions , 2008, TOGS.

[26]  Margaret Martonosi,et al.  An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[27]  John Paul Shen,et al.  Best of both latency and throughput , 2004, IEEE International Conference on Computer Design: VLSI in Computers and Processors, 2004. ICCD 2004. Proceedings..

[28]  Hsien-Hsin S. Lee,et al.  Extending Amdahl's Law for Energy-Efficient Computing in the Many-Core Era , 2008, Computer.

[29]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture , 2003, ISCA '03.

[30]  Eduard Ayguadé,et al.  OpenMP extensions for FPGA accelerators , 2009, 2009 International Symposium on Systems, Architectures, Modeling, and Simulation.

[31]  Engin Ipek,et al.  Core fusion: accommodating software diversity in chip multiprocessors , 2007, ISCA '07.

[32]  Mingyu Chen,et al.  What Hill-Marty model learn from and break through Amdahlʼs law? , 2011, Inf. Process. Lett..

[33]  Norman P. Jouppi,et al.  Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction , 2003, MICRO.

[34]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008, Computer.

[35]  Onur Mutlu,et al.  Accelerating critical section execution with asymmetric multi-core architectures , 2009, ASPLOS.