A benchmark-driven modelling approach for evaluating deployment choices on a multi-core architecture

The complexity of current and emerging architectures provides users with options about how best to use the available resources, but makes predicting performance challenging. In this work a benchmark-driven model is developed for a simple shallow water code on a Cray XE6 system, to explore how deployment choices such as domain decomposition and core affinity affect performance. The resource sharing present in modern multi-core architectures adds various levels of heterogeneity to the system. Shared resources often includes cache, memory, network controllers and in some cases floating point units (as in the AMD Bulldozer), which mean that the access time depends on the mapping of application tasks, and the core's location within the system. Heterogeneity further increases with the use of hardware-accelerators such as GPUs and the Intel Xeon Phi, where many specialist cores are attached to general-purpose cores. This trend for shared resources and non-uniform cores is expected to continue into the exascale era. The complexity of these systems means that various runtime scenarios are possible, and it has been found that under-populating nodes, altering the domain decomposition and non-standard task to core mappings can dramatically alter performance. To find this out, however, is often a process of trial and error. To better inform this process, a performance model was developed for a simple regular grid-based kernel code, shallow. The code comprises two distinct types of work, loop-based array updates and nearest-neighbour halo-exchanges. Separate performance models were developed for each part, both based on a similar methodology. Application specific benchmarks were run to measure performance for different problem sizes under different execution scenarios. These results were then fed into a performance model that derives resource usage for a given deployment scenario, with interpolation between results as necessary.

[1]  Jesús Labarta,et al.  A Framework for Performance Modeling and Prediction , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[2]  Darren J. Kerbyson,et al.  A Performance Model of the Parallel Ocean Program , 2005, Int. J. High Perform. Comput. Appl..

[3]  Chris J. Scheiman,et al.  LogGP: Incorporating Long Messages into the LogP Model for Parallel Computation , 1997, J. Parallel Distributed Comput..

[4]  T. D. Edwards Applying Automated Optimisation Techniques to HPC Applications , 2012 .

[5]  Paul D. Gader,et al.  Image algebra techniques for parallel image processing , 1987 .

[6]  Mary K. Vernon,et al.  A plug-and-play model for evaluating wavefront computations on parallel architectures , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[7]  Torsten Hoefler,et al.  Netgauge: A Network Performance Measurement Framework , 2007, HPCC.

[8]  Renate Hagedorn,et al.  Toward a new generation of world climate research and computing facilities , 2010 .

[9]  William M. Waite,et al.  SLAMM - Automating Memory Analysis for Numerical Algorithms , 2010, LDTA.

[10]  Laura Carrington,et al.  A performance prediction framework for scientific applications , 2003, Future Gener. Comput. Syst..

[11]  Ivona Brandic,et al.  A Survey of the State of the Art in Performance Modeling and Prediction of Parallel and Distributed Computing Systems , 2008 .

[12]  Darren J. Kerbyson,et al.  Analysis of the Weather Research and Forecasting (WRF) Model on Large-Scale Systems , 2007, PARCO.

[13]  R. Sadourny The Dynamics of Finite-Difference Models of the Shallow-Water Equations , 1975 .

[14]  Darren J. Kerbyson,et al.  A Performance Model and Scalability Analysis of the HYCOM Ocean Simulation Application , 2005, IASTED PDCS.

[15]  Georg Hager,et al.  Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[16]  Torsten Hoefler,et al.  Performance modeling for systematic performance tuning , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[17]  Stephen A. Jarvis,et al.  WARPP: a toolkit for simulating high-performance parallel scientific codes , 2009, SIMUTools 2009.