TAMM: A New Topology-Aware Mapping Method for Parallel Applications on the Tianhe-2A Supercomputer

With the increasing size of high performance computing systems, the expensive communication overhead between processors has become a key factor leading to the performance bottleneck. However, default process-to-processor mapping strategies do not take into account the topology of the interconnection network, and thus the distance spanned by communication messages may be particularly far. In order to enhance the communication locality, we propose a new topology-aware mapping method called TAMM. By generating an accurate description of the communication pattern and network topology, TAMM employs a two-step optimization strategy to obtain an efficient mapping solution for various parallel applications. This strategy first extracts an appropriate subset of all idle computing resources on the underlying system and then constructs an optimized one-to-one mapping with a refined iterative algorithm. Experimental results demonstrate that TAMM can effectively improve the communication performance on the Tianhe-2A supercomputer.

[1]  Emmanuel Jeannot,et al.  Topology-aware job mapping , 2018, Int. J. High Perform. Comput. Appl..

[2]  B. Brandfass,et al.  Rank reordering for MPI communication optimization , 2013 .

[3]  Laxmikant V. Kale,et al.  Automating topology aware mapping for supercomputers , 2010 .

[4]  David H. Bailey,et al.  The NAS parallel benchmarks summary and preliminary results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[5]  Torsten Hoefler,et al.  Generic topology mapping strategies for large-scale parallel architectures , 2011, ICS '11.

[6]  Bruce Hendrickson,et al.  The Chaco user`s guide. Version 1.0 , 1993 .

[7]  Torsten Hoefler,et al.  An Overview of Topology Mapping Algorithms and Techniques in High‐Performance Computing , 2014, HiPC 2014.

[8]  Jean Roman,et al.  SCOTCH: A Software Package for Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs , 1996, HPCN Europe.

[9]  Torsten Hoefler,et al.  NUMA-aware shared-memory collective communication for MPI , 2013, HPDC.

[10]  Guillaume Mercier,et al.  Towards an Efficient Process Placement Policy for MPI Applications in Multicore Environments , 2009, PVM/MPI.

[11]  Iain S. Duff European Exascale Software Initiative: Numerical Libraries, Solvers and Algorithms , 2011, Euro-Par Workshops.

[12]  Emmanuel Jeannot,et al.  Process Placement in Multicore Clusters:Algorithmic Issues and Practical Techniques , 2014, IEEE Transactions on Parallel and Distributed Systems.

[13]  Ahmad Afsahi,et al.  PTRAM: A Parallel Topology-and Routing-Aware Mapping Framework for Large-Scale HPC Systems , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[14]  Laxmikant V. Kalé,et al.  Automatic topology mapping of diverse large-scale parallel applications , 2017, ICS '17.

[15]  Vitus J. Leung,et al.  PaCMap: Topology Mapping of Unstructured Communication Patterns onto Non-contiguous Allocations , 2015, ICS.

[16]  Chris Walshaw,et al.  JOSTLE: multilevel graph partitioning software: an overview , 2007 .

[17]  Dhabaleswar K. Panda,et al.  Design of network topology aware scheduling services for large InfiniBand clusters , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[18]  Philippe Olivier Alexandre Navaux,et al.  Multi-core aware process mapping and its impact on communication overhead of parallel applications , 2009, 2009 IEEE Symposium on Computers and Communications.

[19]  Xiangke Liao,et al.  High Performance Interconnect Network for Tianhe System , 2015, Journal of Computer Science and Technology.

[20]  Laxmikant V. Kalé,et al.  An evaluative study on the effect of contention on message latencies in large supercomputers , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[21]  José E. Moreira,et al.  Topology Mapping for Blue Gene/L Supercomputer , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[22]  Emmanuel Jeannot,et al.  Near-Optimal Placement of MPI Processes on Hierarchical NUMA Architectures , 2010, Euro-Par.

[23]  Canqun Yang,et al.  MilkyWay-2 supercomputer: system and application , 2014, Frontiers of Computer Science.

[24]  P. Sadayappan,et al.  Task allocation onto a hypercube by recursive mincut bipartitioning , 1988, C3P.

[25]  Wenguang Chen,et al.  MPIPP: an automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters , 2006, ICS '06.

[26]  Patrick H. Worley,et al.  Communication Characterization and Optimization of Applications Using Topology-Aware Task Mapping on Large Supercomputers , 2016, ICPE.

[27]  Al Geist,et al.  IESP Exascale Challenge: Co-Design of Architectures and Algorithms , 2009, Int. J. High Perform. Comput. Appl..

[28]  Jia Wang,et al.  Topology mapping of irregular parallel applications on torus-connected supercomputers , 2017, The Journal of Supercomputing.

[29]  Yi Zheng,et al.  The TH Express high performance interconnect networks , 2014, Frontiers of Computer Science.

[30]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[31]  Kefei Wang,et al.  The Efficient In-band Management for Interconnect Network in Tianhe-2 System , 2016, 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP).

[32]  Ahmad Afsahi,et al.  Topology-Aware Rank Reordering for MPI Collectives , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[33]  Kwan-Liu Ma,et al.  A Visual Analytics System for Optimizing Communications in Massively Parallel Applications , 2017, 2017 IEEE Conference on Visual Analytics Science and Technology (VAST).