A Balanced Approach to Application Performance Tuning

Current hardware trends place increasing pressure on programmers and tools to optimize scientific code. Numerous tools and techniques exist, but no single tool is a panacea; instead, different tools have different strengths. Therefore, an assortment of performance tuning utilities and strategies are necessary to best utilize scarce resources (e.g., bandwidth, functional units, cache). This paper describes a combined methodology for the optimization process. The strategy combines static assembly analysis using MAQAO with dynamic information from hardware performance monitoring (HPM) and memory traces. We introduce a new technique, decremental analysis (DECAN), to iteratively identify the individual instructions responsible for performance bottlenecks. We present case studies on applications from several independent software vendors (ISVs) on a SMP Xeon Core 2 platform. These strategies help discover problems related to memory access locality and loop unrolling that lead to a sequential performance improvement of a factor of 2.

[1]  Alexei Alexandrov Parallelization Made Easier with Intel PerformanceTuning Utility , 2007 .

[2]  K. Cooper,et al.  An efficient static analysis algorithm to detect redundant memory operations , 2002, MSP '02.

[3]  Brian Armstrong,et al.  A methodology for scientific benchmarking with large-scale applications , 2001 .

[4]  John R. Rice,et al.  A knowledge discovery methodology for the performance evaluation of scientific software , 2000, Neural Parallel Sci. Comput..

[5]  Allen D. Malony,et al.  PerfExplorer: A Performance Data Mining Framework For Large-Scale Parallel Computing , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[6]  Dirk Grunwald,et al.  Identifying potential parallelism via loop-centric profiling , 2007, CF '07.

[7]  Todd Munson,et al.  Benchmarking optimization software with COPS. , 2001 .

[8]  Jack Dongarra,et al.  Integrated Tool Capabilities for Performance Instrumentation and Measurement , .

[9]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[10]  Nathan R. Tallent,et al.  Effective performance measurement and analysis of multithreaded applications , 2009, PPoPP '09.

[11]  Jeffrey C. Carver,et al.  Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[12]  Jack J. Dongarra,et al.  Performance Instrumentation and Measurement for Terascale Systems , 2003, International Conference on Computational Science.

[13]  D. Skinner,et al.  Understanding the causes of performance variability in HPC workloads , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[14]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[15]  Benedetto Risio,et al.  3D-Flame Modelling in Power Plant Applications , 2008, High Performance Computing on Vector Systems.

[16]  Jorge J. Moré,et al.  Digital Object Identifier (DOI) 10.1007/s101070100263 , 2001 .

[17]  George Ho,et al.  PAPI: A Portable Interface to Hardware Performance Counters , 1999 .