论文信息 - Performance Analysis and Optimization of Parallel Scientific Applications on CMP Clusters

Performance Analysis and Optimization of Parallel Scientific Applications on CMP Clusters

Chip multiprocessors (CMP) are widely used for high performance computing. Further, these CMPs are being configured in a hierarchical manner to compose a node in a cluster system. A major challenge to be addressed is efficient use of such cluster systems for large-scale scientific applications. In this paper, we quantify the performance gap resulting from using different number of processors per node; this information is used to provide a baseline for the amount of optimization needed when using all processors per node on CMP clusters. We conduct detailed performance analysis to identify how applications can be modified to efficiently utilize all processors per node using three scientific applications: a 3D particle-in-cell, magnetic fusion application Gyrokinetic Toroidal Code (GTC), a Lattice Boltzmann Method for simulating fluid dynamics (LBM), and an advanced Eulerian gyrokinetic-Maxwell equation solver for simulating microturbulent transport in plasma (GYRO). In terms of refinements, we use conventional techniques such as loop blocking, loop unrolling and loop fusion, and develop hybrid methods for optimizing MPI{\_}Allreduce and MPI{\_}Reduce. Using these optimizations, the application performance for utilizing all processors per node was improved by up to 18.97{%} for GTC, 15.77{%} for LBM and 12.29{%} for GYRO on up to 2048 total processors on the CMP clusters.

Xingfu Wu | Valerie E. Taylor | Charles W. Lively | Sameh Sharkawi

[1] F. Petrini,et al. The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[2] Xingfu Wu,et al. Processor partitioning: an experimental performance analysis of parallel applications on SMP cluster systems , 2007 .

[3] V. Taylor,et al. DESIGN AND IMPLEMENTATION OF PROPHESY AUTOMATIC INSTRUMENTATION AND DATA ENTRY SYSTEM , 2001 .

[4] Xingfu Wu,et al. Using kernel couplings to predict parallel application performance , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[5] Rajeev Thakur,et al. Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[6] Mark R. Fahey,et al. GYRO: A 5-D Gyrokinetic-Maxwell Solver , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[7] Xingfu Wu,et al. Performance Analysis, Modeling and Prediction of a Parallel Multiblock Lattice Boltzmann Application Using Prophesy System , 2006, 2006 IEEE International Conference on Cluster Computing.

[8] Laxmikant V. Kalé,et al. NAMD: Biomolecular Simulation on Thousands of Processors , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[9] Xingfu Wu,et al. Prophesy: an infrastructure for performance analysis and modeling of parallel and grid applications , 2003, PERV.