论文信息 - High Performance Alltoall and Allgather Designs for InfiniBand MIC Clusters

High Performance Alltoall and Allgather Designs for InfiniBand MIC Clusters

Intel's Many-Integrated-Core (MIC) architecture aims to provide Teraflop throughput (through high degrees of parallelism) with a high FLOP/Watt ratio and x86 compatibility. However, this two-fold approach to solving power and programmability challenges for Exascale computing is constrained by certain architectural idiosyncrasies. MIC coprocessors have a memory constrained environment and its processors operate at slower clock rates. Also, being PCI devices, the communication characteristics of MIC co-processors are different compared to communication behavior seen in homogeneous environments. For instance, the performance of sending data from the MIC memory to a remote node's memory through message passing routines has 3x-6x higher latency than sending from the host processor memory. Hence communication libraries that do not consider these architectural subtleties are likely to nullify performance benefits or even cause degradation in applications that intend to use MICs and rely heavily on communication routines. The performance of Message Passing Interface (MPI) operations, especially dense collective operations like All-to-all and All gather, strongly affect the performance of many distributed parallel applications. In this paper, we revisit state-of-the-art algorithms commonly used to implement All-to-all collectives and propose adaptations and optimizations to alleviate architectural bottlenecks on MIC clusters. We also propose a few novel designs to improve the communication latency of these operations. Through micro-benchmarks and applications, we substantiate the benefits of incorporating the proposed adaptations to the All-to-All collective operations. At the micro-benchmark level, the proposed designs show as much as 79% improvement for All gather operation and up to 70% improvement for All-to-all and with the P3DFFT application, an improvement of 38% is seen in overall execution time.

[1] Dhabaleswar K. Panda,et al. Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters , 2013, 2013 IEEE 21st Annual Symposium on High-Performance Interconnects.

[2] Henri E. Bal,et al. MagPIe: MPI's collective communication operations for clustered wide area systems , 1999, PPoPP '99.

[3] Dhabaleswar K. Panda,et al. Efficient Intra-node Communication on Intel-MIC Clusters , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[4] Dhabaleswar K. Panda,et al. MVAPICH-PRISM: A proxy-based communication framework using InfiniBand and SCIF for Intel MIC clusters , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[5] Pradeep Dubey,et al. Designing and dynamically load balancing hybrid LU for multi/many-core , 2011, Computer Science - Research and Development.

[6] Sabela Ramos,et al. Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi , 2013, HPDC.

[7] C. Rosales,et al. Porting to the Intel Xeon Phi: Opportunities and Challenges , 2013, 2013 Extreme Scaling Workshop (xsw 2013).

[8] Larry Meadows,et al. Experiments with WRF on Intel® Many Integrated Core (Intel MIC) Architecture , 2012, IWOMP.

[9] K. Milfeld,et al. Early experiences with the intel many integrated cores accelerated computing technology , 2011 .

[10] Rajeev Thakur,et al. Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..