Performance Analysis of Irregular Collective Communication with the Crystal Router Algorithm

In order to achieve exascale performance it is important to detect potential bottlenecks and identify strategies to overcome them. For this, both applications and system software must be analysed and potentially improved. The EU FP7 project Collaborative Research into Exascale Systemware, Tools & Applications (CRESTA) chose the approach to co-design advanced simulation applications and system software as well as development tools. In this paper, we present the results of a co-design activity focused on the simulation code NEK5000 that aims at performance improvements of collective communication operations. We have analysed the algorithms that form the core of NEK5000’s communication module in order to assess its viability on recent computer architectures before starting to improve its performance. Our results show that the crystal router algorithm performs well in sparse, irregular collective operations for medium and large processor number but improvements for even larger system sizes of the future will be needed. We sketch the needed improvements, which will make the communication algorithms also beneficial for other applications that need to implement latency-dominated communication schemes with short messages. The latency-optimised communication operations will also become used in a runtime-system providing dynamic load balancing, under development within CRESTA.

[1]  Dan Meng,et al.  Multiple Virtual Lanes-aware MPI collective communication in multi-core clusters , 2009, 2009 International Conference on High Performance Computing (HiPC).

[2]  Mark A. Johnson,et al.  Solving problems on concurrent processors. Vol. 1: General techniques and regular problems , 1988 .

[3]  Erwin Laure,et al.  Towards Improving the Communication Performance of CRESTA's Co-Design Application NEK5000 , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[4]  Xiaofang Zhao,et al.  Performance analysis and optimization of MPI collective operations on multi-core clusters , 2009, The Journal of Supercomputing.

[5]  Sayantan Sur,et al.  Efficient and scalable all-to-all personalized exchange for InfiniBand-based clusters , 2004, International Conference on Parallel Processing, 2004. ICPP 2004..

[6]  Larry Kaplan,et al.  The Gemini System Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[7]  Paul F. Fischer,et al.  Fast Parallel Direct Solvers for Coarse Grid Problems , 2001, J. Parallel Distributed Comput..

[8]  H.M. Tufo,et al.  Terascale Spectral Element Algorithms and Implementations , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[9]  Qiang Li,et al.  Optimizing MPI Alltoall Communication of Large Messages in Multicore Clusters , 2011, 2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies.

[10]  Erwin Laure,et al.  Design andImplementation of a Runtime System for Parallel Numerical Simulations on Large-Scale Clusters , 2011, ICCS.