Automatic topology mapping of diverse large-scale parallel applications

Topology-aware mapping aims at assigning tasks to processors in a way that minimizes network load, thus reducing the time spent waiting for communication to complete. Many mapping schemes and algorithms have been proposed. Some are application or domain specific, and others require significant effort by developers or users to successfully apply them. Moreover, a task mapping algorithm by itself is not enough to map the diverse set of applications that exist. Applications can have distinct communication patterns, from point-to-point communication with neighbors in a virtual process grid, to irregular point-to-point communication, to different types of collectives with differing group sizes, and any combination of the above. These patterns should be analyzed, and critical patterns extracted and automatically provided to the mapping algorithm, all without specialized user input. To our knowledge, this problem has not been addressed before for the general case. In this paper, we propose a complete and automatic mapping system that does not require special user involvement, works with any application, and whose mapping performs better than existing schemes, for a wide range of communication patterns and machine topologies. This makes it suitable for online mapping of HPC applications in many different scenarios. We evaluate our scheme with several applications exhibiting different communication patterns (including collectives) on machines with 3D torus, 5D torus and fat-tree network topologies, and show up to 2.2x performance improvements.

[1]  Dhabaleswar K. Panda,et al.  Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Franz Franchetti,et al.  Large-scale electronic structure calculations of high-Z metals on the BlueGene/L platform , 2006, SC.

[3]  Carey E. Priebe,et al.  Fast Approximate Quadratic Programming for Graph Matching , 2015, PloS one.

[4]  Jarek Gryz,et al.  Algorithms and analyses for maximal vector computation , 2007, The VLDB Journal.

[5]  E. Cuthill,et al.  Reducing the bandwidth of sparse symmetric matrices , 1969, ACM '69.

[6]  Toshiyuki Gotoh,et al.  Spectral compact difference hybrid computation of passive scalar in isotropic turbulence , 2012, J. Comput. Phys..

[7]  Karen D. Devinea,et al.  New Challenges in Dynamic Load Balancing , 2004 .

[8]  Martin G. Everett,et al.  Partitioning & Mapping of Unstructured Meshes to Parallel Machine Topologies , 1995, IRREGULAR.

[9]  Teofilo F. Gonzalez,et al.  P-Complete Approximation Problems , 1976, J. ACM.

[10]  Laxmikant V. Kalé,et al.  Optimizing communication for Charm++ applications by reducing network contention , 2011, Concurr. Comput. Pract. Exp..

[11]  Torsten Hoefler,et al.  Generic topology mapping strategies for large-scale parallel architectures , 2011, ICS '11.

[12]  José E. Moreira,et al.  Topology Mapping for Blue Gene/L Supercomputer , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[13]  Bernd Hamann,et al.  Mapping applications with collectives over sub-communicators on torus networks , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  François Pellegrini,et al.  Improvement of the Efficiency of Genetic Algorithms for Scalable Parallel Graph Partitioning in a Multi-level Framework , 2006, Euro-Par.

[15]  Mauricio G. C. Resende,et al.  Grasp: An Annotated Bibliography , 2002 .

[16]  Laxmikant V. Kalé,et al.  Topology-aware task mapping for reducing communication contention on large parallel machines , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[17]  J. D. Teresco,et al.  New challanges in dynamic load balancing , 2005 .

[18]  Philip Heidelberger,et al.  The IBM Blue Gene/Q interconnection network and message unit , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[19]  A. B. Langdon,et al.  Filamentation and forward Brillouin scatter of entire smoothed and aberrated laser beams , 2000 .

[20]  Oscar H. Ibarra,et al.  Heuristic Algorithms for Scheduling Independent Tasks on Nonidentical Processors , 1977, JACM.

[21]  Vipin Kumar,et al.  A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm , 1997, PP.

[22]  Laxmikant V. Kalé,et al.  Application-specific topology-aware mapping for three dimensional topologies , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[23]  Abhishek Gupta,et al.  Parallel Programming with Migratable Objects: Charm++ in Practice , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[24]  Stephen L. Olivier,et al.  Exploiting Geometric Partitioning in Task Mapping for Parallel Computers , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.