论文信息 - Modeling and predicting performance of high performance computing applications on hardware accelerators

Modeling and predicting performance of high performance computing applications on hardware accelerators

Hybrid-core systems speedup applications by offloading certain compute operations that can run faster on hardware accelerators. However, such systems require significant programming and porting effort to gain a performance benefit from the accelerators. Therefore, prior to porting it is prudent to investigate the predicted performance benefit of accelerators for a given workload. To address this problem we present a performance-modeling framework that predicts the application performance rapidly and accurately for hybrid-core systems. We present predictions for two full-scale HPC applications—HYCOM and Milc. Our results for two accelerators (GPU and FPGA) show that gather/scatter and stream operations can speedup by as much as a factor of 15 and overall compute time of Milc and HYCOM improve by 3.4% and 20%, respectively. We also show that in order to benefit from the accelerators, 70% of the latency of data transfer time between the CPU and the accelerators needs to be overcome.

[1] Somayeh Sardashti,et al. The gem5 simulator , 2011, CARN.

[2] Sigarch. ISCA 2009 : the 36th Annual International Symposium on Computer Architecture, Conference Proceedings, Austin, Texas, USA, 20-24 June 2009 , 2009 .

[3] Bashar Qudah,et al. Accelerating the HMMER sequence analysis suite using conventional processors , 2006, 20th International Conference on Advanced Information Networking and Applications - Volume 1 (AINA'06).

[4] Hyesoon Kim,et al. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[5] Ludmila Svobodová. Computer Performance Measurement and Evaluation Methods: Analysis and Applications. , 1974 .

[6] David I. August,et al. Microarchitectural exploration with Liberty , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[7] David J. Lilja,et al. Simulation of computer architectures: simulators, benchmarks, methodologies, and recommendations , 2006, IEEE Transactions on Computers.

[8] Clark Jeffries. The Memory Model , 1991 .

[9] N.K. Govindaraju,et al. A Memory Model for Scientific Algorithms on Graphics Processors , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[10] Adolfy Hoisie,et al. Modelling the performance of large-scale systems , 2003, IEE Proc. Softw..

[11] Lin Sun,et al. Semi-Empirical Multiprocessor Performance Predictions , 1996, J. Parallel Distributed Comput..

[12] Alan Jay Smith,et al. Analysis of benchmark characteristics and benchmark performance prediction , 1996, TOCS.

[13] Sadaf R. Alam,et al. An Exploration of Performance Attributes for Symbolic Modeling of Emerging Processing Devices , 2007, HPCC.

[14] Ivona Brandic,et al. Performance Modeling and Prediction of Parallel and Distributed Computing Systems: A Survey of the State of the Art , 2007, First International Conference on Complex, Intelligent and Software Intensive Systems (CISIS'07).

[15] Samuel Williams,et al. The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[16] Daniel A. Reed,et al. Integrated compilation and scalability analysis for parallel systems , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[17] Todd M. Austin,et al. The SimpleScalar tool set, version 2.0 , 1997, CARN.

[18] Laura Carrington,et al. A Framework for Application Performance Modeling and Prediction , 2002 .

[19] Christopher J. Hughes,et al. RSIM: Simulating Shared-Memory Multiprocessors with ILP Processors , 2002, Computer.

[20] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[21] Tony M. Brewer,et al. Instruction Set Innovations for the Convey HC-1 Computer , 2010, IEEE Micro.

[22] Ramesh Subramonian,et al. LogP: a practical model of parallel computation , 1996, CACM.

[23] Erich Strohmaier,et al. A genetic algorithms approach to modeling the performance of memory-bound computations , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[24] Kim M. Hazelwood,et al. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[25] Michael Laurenzano,et al. How well can simple metrics represent the performance of HPC applications? , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[26] R. Saavedra,et al. Measuring Cache and TLB Performance and Their Effect on Benchmark Run Times USC-CS-93-546 , 1993 .

[27] Michael Laurenzano,et al. PSINS: An Open Source Event Tracer and Execution Simulator , 2009, 2009 DoD High Performance Computing Modernization Program Users Group Conference.

[28] Chris J. Scheiman,et al. LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[29] Brad Calder,et al. Using SimPoint for accurate and efficient simulation , 2003, SIGMETRICS '03.

[30] Jesús Labarta,et al. A Framework for Performance Modeling and Prediction , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[31] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[32] Yossi Matias,et al. The Queue-Read Queue-Write PRAM Model: Accounting for Contention in Parallel Algorithms , 1999, SIAM J. Comput..

[33] Paul D. Gader,et al. Image algebra techniques for parallel image processing , 1987 .

[34] David H. Bailey,et al. The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[35] Fredrik Larsson,et al. Simics: A Full System Simulation Platform , 2002, Computer.

[36] Stephen W. Poole,et al. An idiom-finding tool for increasing productivity of accelerators , 2011, ICS '11.

[37] Tom R. Halfhill. NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[38] Michael Laurenzano,et al. PEBIL: Efficient static binary instrumentation for Linux , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[39] Michael A. Frumkin,et al. Automatic Recognition of Performance Idioms in Scientific Applications , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[40] Jason D. Bakos. High-Performance Heterogeneous Computing with the Convey HC-1 , 2010, Computing in Science & Engineering.