Topology mapping of irregular parallel applications on torus-connected supercomputers

Supercomputers with ever increasing computing power are being built for scientific applications. As the system size scales up, so does the size of interconnect network. As a result, communication in supercomputers becomes increasingly expensive due to the long distance between nodes and network contention. Topology mapping, which maps parallel application processes onto compute nodes by considering network topology and application communication pattern, is an essential technique for communication optimization. In this paper, we study the topology mapping problem for torus-connected supercomputers, and present an analytical topology mapping algorithm for parallel applications with irregular communication patterns. We consider our problem as a discrete optimization problem in the geometric domain of a torus topology, and design an analytical mapping algorithm, which uses numerical solvers to compute the mapping. Experimental results show that our algorithm provides high-quality mappings on 3-dimensional torus, which significantly reduce the communication time by up to 72%.

[1]  Jingjin Wu,et al.  Hierarchical task mapping of cell-based AMR cosmology simulations , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Emmanuel Jeannot,et al.  Process Placement in Multicore Clusters:Algorithmic Issues and Practical Techniques , 2014, IEEE Transactions on Parallel and Distributed Systems.

[3]  Bernd Hamann,et al.  Mapping applications with collectives over sub-communicators on torus networks , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Shahid H. Bokhari,et al.  On the Mapping Problem , 1981, IEEE Transactions on Computers.

[5]  S.M. Bhandarkar,et al.  The Hough Transform on a Reconfigurable Multi-Ring Network , 1995, J. Parallel Distributed Comput..

[6]  Laxmikant V. Kalé,et al.  Topology-aware task mapping for reducing communication contention on large parallel machines , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[7]  Jiazheng Zhou,et al.  Hierarchical Mapping for HPC Applications , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[8]  Dennis Abts,et al.  Cray XT4 and Seastar 3-D Torus Interconnect , 2011, Encyclopedia of Parallel Computing.

[9]  P. Sadayappan,et al.  Task allocation onto a hypercube by recursive mincut bipartitioning , 1988, C3P.

[10]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[11]  Jingjin Wu,et al.  Improving Parallel IO Performance of Cell-based AMR Cosmology Applications , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[12]  Jesper Larsson Träff Implementing the MPI process topology mechanism , 2002, SC '02.

[13]  Abhinav Bhatele,et al.  RAHTM: Routing Algorithm Aware Hierarchical Task Mapping , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  D. R. Emerson,et al.  An optimal migration algorithm for dynamic load balancing , 1998, Concurr. Pract. Exp..

[15]  Gene H. Golub,et al.  Matrix computations , 1983 .

[16]  Hamid R. Arabnia,et al.  Parallel stereocorrelation on a reconfigurable multi-ring network , 1996, The Journal of Supercomputing.

[17]  L. Bic,et al.  On the mapping problem using simulated annealing , 1989, Eighth Annual International Phoenix Conference on Computers and Communications. 1989 Conference Proceedings.

[18]  Jingjin Wu,et al.  Performance Emulation of Cell-Based AMR Cosmology Simulations , 2011, 2011 IEEE International Conference on Cluster Computing.

[19]  Francine Berman,et al.  On Mapping Parallel Algorithms into Parallel Architectures , 1987, J. Parallel Distributed Comput..

[20]  Stephen L. Olivier,et al.  Exploiting Geometric Partitioning in Task Mapping for Parallel Computers , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[21]  F. Pellegrini,et al.  Static mapping by dual recursive bipartitioning of process architecture graphs , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[22]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[23]  V. Gregory Weirs,et al.  Adaptive Mesh Refinement - Theory and Applications , 2008 .

[24]  Chris C. N. Chu,et al.  FastPlace: efficient analytical placement using cell shifting, iterative local refinement,and a hybrid net model , 2005, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[25]  S. Arunkumar,et al.  A randomized heuristics for the mapping problem: The genetic approach , 1992, Parallel Comput..

[26]  José E. Moreira,et al.  Topology Mapping for Blue Gene/L Supercomputer , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[27]  Imtiaz Ahmad,et al.  Particle swarm optimization for task assignment problem , 2002, Microprocess. Microsystems.

[28]  Torsten Hoefler,et al.  Generic topology mapping strategies for large-scale parallel architectures , 2011, ICS '11.

[29]  Laxmikant V. Kale,et al.  Automating Topology Aware Mapping for Supercomputers , 2010 .

[30]  Jingjin Wu,et al.  Hierarchical task mapping for parallel applications on supercomputers , 2015, The Journal of Supercomputing.

[31]  YANQING CHEN,et al.  Algorithm 8 xx : CHOLMOD , supernodal sparse Cholesky factorization and update / downdate ∗ , 2006 .

[32]  Shang-Hua Teng,et al.  Solving Sparse, Symmetric, Diagonally-Dominant Linear Systems in Time O(m1.31) , 2003, ArXiv.

[33]  Zhiling Lan,et al.  Application power profiling on IBM Blue Gene/Q , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[34]  Arthur R. Butz,et al.  Alternative Algorithm for Hilbert's Space-Filling Curve , 1971, IEEE Transactions on Computers.

[35]  Shang-Hua Teng,et al.  Solving sparse, symmetric, diagonally-dominant linear systems in time O(m/sup 1.31/ , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[36]  A. Klypin,et al.  Adaptive Refinement Tree: A New High-Resolution N-Body Code for Cosmological Simulations , 1997, astro-ph/9701195.

[37]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.