Modeling and predicting performance of high performance computing applications on hardware accelerators

Hybrid-core systems speedup applications by offloading certain compute operations that can run faster on hardware accelerators. However, such systems require significant programming and porting effort to gain a performance benefit from the accelerators. Therefore, prior to porting it is prudent to investigate the predicted performance benefit of accelerators for a given workload. To address this problem we present a performance-modeling framework that predicts the application performance rapidly and accurately for hybrid-core systems. We present predictions for two full-scale HPC applications—HYCOM and Milc. Our results for two accelerators (GPU and FPGA) show that gather/scatter and stream operations can speedup by as much as a factor of 15 and overall compute time of Milc and HYCOM improve by 3.4% and 20%, respectively. We also show that in order to benefit from the accelerators, 70% of the latency of data transfer time between the CPU and the accelerators needs to be overcome.

[1]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[2]  Sigarch ISCA 2009 : the 36th Annual International Symposium on Computer Architecture, Conference Proceedings, Austin, Texas, USA, 20-24 June 2009 , 2009 .

[3]  Bashar Qudah,et al.  Accelerating the HMMER sequence analysis suite using conventional processors , 2006, 20th International Conference on Advanced Information Networking and Applications - Volume 1 (AINA'06).

[4]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[5]  Ludmila Svobodová Computer Performance Measurement and Evaluation Methods: Analysis and Applications. , 1974 .

[6]  David I. August,et al.  Microarchitectural exploration with Liberty , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[7]  David J. Lilja,et al.  Simulation of computer architectures: simulators, benchmarks, methodologies, and recommendations , 2006, IEEE Transactions on Computers.

[8]  Clark Jeffries The Memory Model , 1991 .

[9]  N.K. Govindaraju,et al.  A Memory Model for Scientific Algorithms on Graphics Processors , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[10]  Adolfy Hoisie,et al.  Modelling the performance of large-scale systems , 2003, IEE Proc. Softw..

[11]  Lin Sun,et al.  Semi-Empirical Multiprocessor Performance Predictions , 1996, J. Parallel Distributed Comput..

[12]  Alan Jay Smith,et al.  Analysis of benchmark characteristics and benchmark performance prediction , 1996, TOCS.

[13]  Sadaf R. Alam,et al.  An Exploration of Performance Attributes for Symbolic Modeling of Emerging Processing Devices , 2007, HPCC.

[14]  Ivona Brandic,et al.  Performance Modeling and Prediction of Parallel and Distributed Computing Systems: A Survey of the State of the Art , 2007, First International Conference on Complex, Intelligent and Software Intensive Systems (CISIS'07).

[15]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[16]  Daniel A. Reed,et al.  Integrated compilation and scalability analysis for parallel systems , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[17]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[18]  Laura Carrington,et al.  A Framework for Application Performance Modeling and Prediction , 2002 .

[19]  Christopher J. Hughes,et al.  RSIM: Simulating Shared-Memory Multiprocessors with ILP Processors , 2002, Computer.

[20]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[21]  Tony M. Brewer,et al.  Instruction Set Innovations for the Convey HC-1 Computer , 2010, IEEE Micro.

[22]  Ramesh Subramonian,et al.  LogP: a practical model of parallel computation , 1996, CACM.

[23]  Erich Strohmaier,et al.  A genetic algorithms approach to modeling the performance of memory-bound computations , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[24]  Kim M. Hazelwood,et al.  Where is the data? Why you cannot debate CPU vs. GPU performance without the answer , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[25]  Michael Laurenzano,et al.  How well can simple metrics represent the performance of HPC applications? , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[26]  R. Saavedra,et al.  Measuring Cache and TLB Performance and Their Effect on Benchmark Run Times USC-CS-93-546 , 1993 .

[27]  Michael Laurenzano,et al.  PSINS: An Open Source Event Tracer and Execution Simulator , 2009, 2009 DoD High Performance Computing Modernization Program Users Group Conference.

[28]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[29]  Brad Calder,et al.  Using SimPoint for accurate and efficient simulation , 2003, SIGMETRICS '03.

[30]  Jesús Labarta,et al.  A Framework for Performance Modeling and Prediction , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[31]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[32]  Yossi Matias,et al.  The Queue-Read Queue-Write PRAM Model: Accounting for Contention in Parallel Algorithms , 1999, SIAM J. Comput..

[33]  Paul D. Gader,et al.  Image algebra techniques for parallel image processing , 1987 .

[34]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[35]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[36]  Stephen W. Poole,et al.  An idiom-finding tool for increasing productivity of accelerators , 2011, ICS '11.

[37]  Tom R. Halfhill NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[38]  Michael Laurenzano,et al.  PEBIL: Efficient static binary instrumentation for Linux , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[39]  Michael A. Frumkin,et al.  Automatic Recognition of Performance Idioms in Scientific Applications , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[40]  Jason D. Bakos High-Performance Heterogeneous Computing with the Convey HC-1 , 2010, Computing in Science & Engineering.