Kernel-assisted and Topology-aware MPI Collective Communication among Multicore or Many-core Clusters

Multicore or many-core clusters have become the most prominent form of High Performance Computing (HPC) systems. Hardware complexity and hierarchies not only exist in the inter-node layer, i.e., hierarchical networks, but also exist in internals of multicore compute nodes, e.g., Non Uniform Memory Accesses (NUMA), networkstyle interconnect, and memory and shared cache hierarchies. Message Passing Interface (MPI), the most widely adopted in the HPC communities, suffers from decreased performance and portability due to increased hardware complexity of multiple levels. We identified three critical issues specific to collective communication: The first problem arises from the gap between logical collective topologies and underlying hardware topologies; Second, current MPI communications lack efficient shared memory message delivering approaches; Last, on distributed memory machines, like multicore clusters, a single approach cannot encompass the extreme variations not only in the bandwidth and latency capabilities, but also in features such as the aptitude to operate multiple concurrent copies simultaneously. To bridge the gap between logical collective topologies and hardware topologies, we developed a distance-aware framework to integrate the knowledge of hardware distance into collective algorithms in order to dynamically reshape the communication patterns to suit the hardware capabilities. Based on process distance information, we used graph partitioning techniques to organize the MPI processes in a multilevel hierarchy, mapping on the hardware characteristics. Meanwhile, we took advantage of the kernel-assisted one-sided single-copy approach (KNEM) as the

[1]  Collin McCurdy,et al.  Early evaluation of IBM BlueGene/P , 2008, HiPC 2008.

[2]  Lars Paul Huse Collective Communication on Dedicated Clusters of Workstations , 1999, PVM/MPI.

[3]  F.J. Mora,et al.  Scalable Hardware-Based Multicast Trees , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[4]  Bronis R. de Supinski,et al.  A Multilevel Approach to Topology-Aware Collective Operations in Computational Grids , 2002, ArXiv.

[5]  George Bosilca,et al.  Locality and Topology Aware Intra-node Communication among Multicore CPUs , 2010, EuroMPI.

[6]  Manjunath Gorentla Venkata,et al.  Cheetah: A Framework for Scalable Hierarchical Collective Operations , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[7]  Dhabaleswar K. Panda,et al.  Designing topology-aware collective communication algorithms for large scale InfiniBand clusters: Case studies with Scatter and Gather , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[8]  Wu-chun Feng,et al.  The Quadrics network (QsNet): high-performance clustering technology , 2001, HOT 9 Interconnects. Symposium on High Performance Interconnects.

[9]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[10]  Dhabaleswar K. Panda,et al.  Scalable, high-performance NIC-based all-to-all broadcast over Myrinet/GM , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[11]  Jeffrey M. Squyres,et al.  The Component Architecture of Open MPI: Enabling Third-Party Collective Algorithms* , 2005 .

[12]  Michael Lang,et al.  A Performance and Scalability Analysis of the BlueGene/L Architecture , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[13]  D.K. Panda,et al.  Scalable NIC-based Reduction on Large-scale Clusters , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[14]  William Gropp,et al.  MPICH2: A New Start for MPI Implementations , 2002, PVM/MPI.

[15]  Galen M. Shipman,et al.  MPI Support for Multi-core Architectures: Optimized Shared Memory Collectives , 2008, PVM/MPI.

[16]  Dhabaleswar K. Panda,et al.  Efficient asynchronous memory copy operations on multi-core systems and I/OAT , 2007, 2007 IEEE International Conference on Cluster Computing.

[17]  Bernard Tourancheau,et al.  BIP-SMP : High Performance Message Passing over a Cluster of Commodity SMPs , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[18]  Henri E. Bal,et al.  MagPIe: MPI's collective communication operations for clustered wide area systems , 1999, PPoPP '99.

[19]  Amith R. Mamidala,et al.  MPI Collective Communications on The Blue Gene/P Supercomputer: Algorithms and Optimizations , 2009, Hot Interconnects.

[20]  J. Nieplocha,et al.  QSNET/sup II/: defining high-performance network design , 2005, IEEE Micro.

[21]  Dhabaleswar K. Panda,et al.  High performance and reliable NIC-based multicast over Myrinet/GM-2 , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[22]  George Bosilca,et al.  Kernel Assisted Collective Intra-node MPI Communication among Multi-Core and Many-Core CPUs , 2011, 2011 International Conference on Parallel Processing.

[23]  Brice Goglin,et al.  Optimizing MPI communication within large multicore nodes with kernel assistance , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[24]  Philip Heidelberger,et al.  Optimization of All-to-All Communication on the Blue Gene/L Supercomputer , 2008, 2008 37th International Conference on Parallel Processing.

[25]  Thomas Hérault,et al.  Process Distance-Aware Adaptive MPI Collective Communications , 2011, 2011 IEEE International Conference on Cluster Computing.

[26]  George Bosilca,et al.  HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[27]  Brice Goglin,et al.  Dodging Non-uniform I/O Access in Hierarchical Collective Operations for Multicore Clusters , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[28]  Guillaume Mercier,et al.  Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis , 2009, 2009 International Conference on Parallel Processing.

[29]  Brice Goglin,et al.  KNEM: A generic and scalable kernel-assisted intra-node MPI communication framework , 2013, J. Parallel Distributed Comput..

[30]  Kevin T. Pedretti,et al.  SMARTMAP: Operating system support for efficient data sharing among processes on a multi-core processor , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[31]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[32]  Ron Brightwell,et al.  Characterizing application sensitivity to OS interference using kernel-level noise injection , 2008, HiPC 2008.

[33]  Henri E. Bal,et al.  Sensitivity of parallel applications to large differences in bandwidth and latency in two-layer interconnects , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[34]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[35]  Wenguang Chen,et al.  MPIPP: an automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters , 2006, ICS '06.

[36]  Ramakrishnan Rajamony,et al.  PERCS: The IBM POWER7-IH high-performance computing system , 2011, IBM J. Res. Dev..

[37]  Hao Zhu,et al.  Hierarchical Collectives in MPICH2 , 2009, PVM/MPI.

[38]  Hesham El-Rewini,et al.  Message Passing Interface (MPI) , 2005 .

[39]  Guillaume Mercier,et al.  Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[40]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[41]  Jack Dongarra,et al.  TOP500 Supercomputer sites 11/2000 , 2000 .

[42]  J. L. Traff Implementing the MPI Process Topology Mechanism , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[43]  Sayantan Sur,et al.  LiMIC: support for high-performance MPI intra-node communication on Linux cluster , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[44]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[45]  Bronis R. de Supinski,et al.  Exploiting hierarchy in parallel computer networks to optimize collective operation performance , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[46]  Rajeev Thakur,et al.  Improving the Performance of Collective Operations in MPICH , 2003, PVM/MPI.

[47]  Amith R. Mamidala,et al.  Efficient Shared Memory and RDMA Based Design for MPI_Allgather over InfiniBand , 2006, PVM/MPI.

[48]  Keith D. Underwood,et al.  SeaStar Interconnect: Balanced Bandwidth for Scalable Performance , 2006, IEEE Micro.

[49]  Ying Qian,et al.  RDMA-based and SMP-aware Multi-port All-Gather on Multi-rail QsNet^II SMP Clusters , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[50]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[51]  Emmanuel Jeannot,et al.  Near-Optimal Placement of MPI Processes on Hierarchical NUMA Architectures , 2010, Euro-Par.

[52]  Amith R. Mamidala,et al.  Efficient SMP-aware MPI-level broadcast over InfiniBand's hardware multicast , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[53]  Joseph Y. Halpern A Computing Research Repository , 1998, D Lib Mag..

[54]  P. Heidelberger,et al.  The IBM Blue Gene/Q Interconnection Fabric , 2012, IEEE Micro.

[55]  José Duato,et al.  Scalable Hardware-Based Multicast Trees , 2003, SC.

[56]  Jesús Labarta,et al.  Reducing the Impact of Soft Errors on Fabric-Based Collective Communications , 2011, Euro-Par Workshops.