Framework for a productive performance optimization

Modern supercomputers deliver large computational power, but it is difficult for an application to exploit such power. One factor that limits the application performance is the single node performance. While many performance tools use the microprocessor performance counters to provide insights on serial node performance issues, the complex semantics of these counters pose an obstacle to an inexperienced developer. We present a framework that allows easy identification and qualification of serial node performance bottlenecks in parallel applications. The output of the framework is precise and it is capable of correlating performance inefficiencies with small regions of code within the application. The framework not only points to regions of code but also simplifies the semantics of the performance counters into metrics that refer to processor functional units. With such information the developer can focus on the identified code and improve it by knowing which processor execution unit is degrading the performance. To demonstrate the usefulness of the framework we apply it to three already optimized applications using realistic inputs and, according to the results, modify their source code. By doing modifications that require little effort, we successfully increase the applications' performance from 10% to 30% and thus shorten the time required to reach the solution and/or allow facing increased problem sizes.

[1]  Nathan R. Tallent,et al.  HPCToolkit: performance tools for scientific computing , 2008 .

[2]  Juan Gonzalez,et al.  On-line detection of large-scale parallel application's structure , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[3]  Alex Ramírez,et al.  On the memory system requirements of future scientific applications: Four case-studies , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[4]  James E. Smith,et al.  A performance counter architecture for computing accurate CPI components , 2006, ASPLOS XII.

[5]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[6]  Juan Gonzalez,et al.  Automatic detection of parallel applications computation phases , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[7]  Juan Gonzalez,et al.  Performance Data Extrapolation in Parallel Codes , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[8]  Wenguang Chen,et al.  OpenUH: an optimizing, portable OpenMP compiler , 2007, Concurr. Comput. Pract. Exp..

[9]  Michael Stumm,et al.  Online performance analysis by statistical sampling of microprocessor performance counters , 2005, ICS '05.

[10]  Interner Bericht VAMPIR: Visualization and Analysis of MPI Resources , 1996 .

[11]  Jesús Labarta,et al.  Detailed Performance Analysis Using Coarse Grain Sampling , 2009, Euro-Par Workshops.

[12]  Juan Gonzalez,et al.  Automatic Evaluation of the Computation Structure of Parallel Applications , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[13]  Bernd Mohr,et al.  Usage of the SCALASCA toolset for scalable performance analysis of large-scale parallel applications , 2008, Parallel Tools Workshop.

[14]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[15]  Jesús Labarta,et al.  Unveiling Internal Evolution of Parallel Application Computation Phases , 2011, 2011 International Conference on Parallel Processing.

[16]  Pratap Pattnaik,et al.  High-Performance Sorting Algorithms on AIX , 2008 .

[17]  A. Mericas,et al.  Workload characterization for the design of future servers , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[18]  Stijn Eyerman,et al.  Mechanistic-empirical processor performance modeling for constructing CPI stacks on real hardware , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[19]  M. J. Astrophysik,et al.  Deceleration of arbitrarily magnetized GRB ejecta: the complete evolution , 2008, 0810.2961.

[20]  Sverre Jarp A Methodology for using the Itanium-2 Performance Counters for Bottleneck Analysis , 2002 .

[21]  Gokul B. Kandiraju,et al.  IBM Research Report High-Performance Sorting Algorithms on AIX , 2008 .

[22]  Charles Yount,et al.  Using Model Trees for Computer Architecture Performance Analysis of Software Applications , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.

[23]  Toni Cortes,et al.  PARAVER: A Tool to Visualize and Analyze Parallel Code , 2007 .

[24]  Allen D. Malony,et al.  Capturing performance knowledge for automated analysis , 2008, HiPC 2008.

[25]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[26]  Cycle Accounting Analysis on Intel ® Core TM 2 Processors , .

[27]  Jack J. Dongarra,et al.  A Portable Programming Interface for Performance Evaluation on Modern Processors , 2000, Int. J. High Perform. Comput. Appl..

[28]  Lars Koesterke,et al.  PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.