论文信息 - A Balanced Approach to Application Performance Tuning

A Balanced Approach to Application Performance Tuning

Current hardware trends place increasing pressure on programmers and tools to optimize scientific code. Numerous tools and techniques exist, but no single tool is a panacea; instead, different tools have different strengths. Therefore, an assortment of performance tuning utilities and strategies are necessary to best utilize scarce resources (e.g., bandwidth, functional units, cache). This paper describes a combined methodology for the optimization process. The strategy combines static assembly analysis using MAQAO with dynamic information from hardware performance monitoring (HPM) and memory traces. We introduce a new technique, decremental analysis (DECAN), to iteratively identify the individual instructions responsible for performance bottlenecks. We present case studies on applications from several independent software vendors (ISVs) on a SMP Xeon Core 2 platform. These strategies help discover problems related to memory access locality and loop unrolling that lead to a sequential performance improvement of a factor of 2.

[1] Alexei Alexandrov. Parallelization Made Easier with Intel PerformanceTuning Utility , 2007 .

[2] K. Cooper,et al. An efficient static analysis algorithm to detect redundant memory operations , 2002, MSP '02.

[3] Brian Armstrong,et al. A methodology for scientific benchmarking with large-scale applications , 2001 .

[4] John R. Rice,et al. A knowledge discovery methodology for the performance evaluation of scientific software , 2000, Neural Parallel Sci. Comput..

[5] Allen D. Malony,et al. PerfExplorer: A Performance Data Mining Framework For Large-Scale Parallel Computing , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[6] Dirk Grunwald,et al. Identifying potential parallelism via loop-centric profiling , 2007, CF '07.

[7] Todd Munson,et al. Benchmarking optimization software with COPS. , 2001 .

[8] Jack Dongarra,et al. Integrated Tool Capabilities for Performance Instrumentation and Measurement , .

[9] Susan L. Graham,et al. Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[10] Nathan R. Tallent,et al. Effective performance measurement and analysis of multithreaded applications , 2009, PPoPP '09.

[11] Jeffrey C. Carver,et al. Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[12] Jack J. Dongarra,et al. Performance Instrumentation and Measurement for Terascale Systems , 2003, International Conference on Computational Science.

[13] D. Skinner,et al. Understanding the causes of performance variability in HPC workloads , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[14] Allen D. Malony,et al. The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[15] Benedetto Risio,et al. 3D-Flame Modelling in Power Plant Applications , 2008, High Performance Computing on Vector Systems.

[16] Jorge J. Moré,et al. Digital Object Identifier (DOI) 10.1007/s101070100263 , 2001 .

[17] George Ho,et al. PAPI: A Portable Interface to Hardware Performance Counters , 1999 .