Automatic Recognition of Performance Idioms in Scientific Applications

Basic data flow patterns that we call \textbf{performance idioms}, such as stream, transpose, reduction, random access and stencil, are common in scientific numerical applications. We hypothesize that a small number of idioms can cover most programming constructs that dominate the execution time of scientific codes and can be used to approximate the application performance. To check these hypotheses, we proposed an automatic idioms recognition method and implemented the method, based on the open source compiler Open64. With the NAS Parallel Benchmark (NPB) as a case study, the prototype system is about $90%$ accurate compared with idiom classification by a human expert. Our results showed that the above five idioms suffice to cover $100%$ of the six NPB codes (MG, CG, FT, BT, SP and LU). We also compared the performance of our idiom benchmarks with their corresponding instances in the NPB codes on two different platforms with different methods. The approximation accuracy is up to $96.6%$. The contribution is to show that a small set of idioms can cover more complex codes, that idioms can be recognized automatically, and that suitably defined idioms may approximate application performance.

[1]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[2]  Ilya Sharapov,et al.  Characteristics of workloads used in high performance and technical computing , 2007, ICS '07.

[3]  Henri Casanova,et al.  Measuring the Performance and Reliability of Production Computational Grids , 2006, 2006 7th IEEE/ACM International Conference on Grid Computing.

[4]  Arun Jagatheesan,et al.  Understanding the Impact of Emerging Non-Volatile Memories on High-Performance, IO-Intensive Computing , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Jeffrey Bennett,et al.  DASH-IO: an empirical study of flash-based IO for HPC , 2010 .

[6]  Michael Lang,et al.  Entering the petaflop era: The architecture and performance of Roadrunner , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Bo Lu,et al.  Compiler optimization of implicit reductions for distributed memory multiprocessors , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[8]  Jeffrey K. Hollingsworth,et al.  Using Dynamic Tracing Sampling to Measure Long Running Programs , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[9]  Xingfu Wu,et al.  Using kernel couplings to predict parallel application performance , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[10]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[11]  A. Frumkin Data Flow Pattern Analysis of Scientific Applications February , 2005 .

[12]  Allan Snavely,et al.  Predicting disk I/O time of HPC applications on flash drives , 2010, 2010 IEEE Globecom Workshops.

[13]  Michael A. Frumkin,et al.  Code coverage, performance approximation and automatic recognition of idioms in scientific applications , 2008, HPDC '08.

[14]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[15]  John M. Mellor-Crummey,et al.  Cross-architecture performance predictions for scientific applications using parameterized models , 2004, SIGMETRICS '04/Performance '04.

[16]  Sandeep K. S. Gupta,et al.  DASH: a Recipe for a Flash-based Data Intensive Supercomputer , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[18]  Monica S. Lam,et al.  Interprocedural parallelization analysis in SUIF , 2005, TOPL.

[19]  Guido Araujo,et al.  Code compression based on operand factorization , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[20]  Xingfu Wu,et al.  Prophesy: an infrastructure for performance analysis and modeling of parallel and grid applications , 2003, PERV.

[21]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[22]  Rudolf Eigenmann,et al.  Idiom recognition in the Polaris parallelizing compiler , 1995, ICS '95.

[23]  ImERWfST Partners Instruction set innovations for Convey's HC-1 computer , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[24]  D.S. Wills,et al.  On the extraction and analysis of prevalent dataflow patterns , 2004, IEEE International Workshop on Workload Characterization, 2004. WWC-7. 2004.

[25]  Allan Snavely,et al.  User-guided symbiotic space-sharing of real workloads , 2006, ICS '06.