Local search to improve coordinate-based task mapping

Local search algorithm that improves on task mapping algs for stencil patterns.Algorithm shown to reduce total running time and running time variability.Improvement shown to depend on the allocation algorithm used.Number of swaps made shown to be reasonable in practice. We present a local search strategy to improve the coordinate-based mapping of a parallel job's tasks to the MPI ranks of its parallel allocation in order to reduce network congestion and the job's communication time. The goal is to reduce the number of network hops between communicating pairs of ranks. Our target is applications with a nearest-neighbor stencil communication pattern running on mesh systems with non-contiguous processor allocation, such as Cray XE and XK Systems. Using the miniGhost mini-app, which models the shock physics application CTH, we demonstrate that our strategy reduces application running time while also reducing the runtime variability. We further show that mapping quality can vary based on the selected allocation algorithm, even between allocation algorithms of similar apparent quality.

[1]  José E. Moreira,et al.  Resource allocation and utilization in the Blue Gene/L supercomputer , 2005, IBM J. Res. Dev..

[2]  Courtenay T. Vaughan,et al.  Navigating an Evolutionary Fast Path to Exascale , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[3]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[4]  David P. Bunde,et al.  PReMAS: Simulator for Resource Management , 2014, 2014 43rd International Conference on Parallel Processing Workshops.

[5]  David P. Bunde,et al.  Local search to improve task mapping. , 2014 .

[6]  Sandia Report,et al.  MiniGhost: A Miniapp for Exploring Boundary Exchange Strategies Using Stencil Computations in Scientific Parallel Computing , 2012 .

[7]  Madhav V. Marathe,et al.  Compact Location Problems , 1993, Theor. Comput. Sci..

[8]  J. M. McGlaun,et al.  CTH: A software family for multi-dimensional shock physics analysis , 1995 .

[9]  Minna Palmroth,et al.  Topology Aware Process Mapping , 2012, PARA.

[10]  Stephen L. Olivier,et al.  Exploiting Geometric Partitioning in Task Mapping for Parallel Computers , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[11]  P. Sadayappan,et al.  Selective buddy allocation for scheduling parallel jobs on clusters , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[12]  Cynthia A. Phillips,et al.  Communication-Aware Processor Allocation for Supercomputers: Finding Point Sets of Small Average Distance , 2007, Algorithmica.

[13]  David P. Bunde,et al.  Faster high-quality processor allocation. , 2010 .

[14]  Laxmikant V. Kalé,et al.  Topology-aware task mapping for reducing communication contention on large parallel machines , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[15]  Shahid H. Bokhari,et al.  On the Mapping Problem , 1981, IEEE Transactions on Computers.

[16]  Laxmikant V. Kalé,et al.  Automated mapping of regular communication graphs on mesh interconnects , 2010, 2010 International Conference on High Performance Computing.

[17]  Courtenay T. Vaughan,et al.  Reducing the Bulk in the Bulk Synchronous Parallel Model , 2013, Parallel Process. Lett..

[18]  Dhabaleswar K. Panda,et al.  Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Vipin Kumar,et al.  Parallel static and dynamic multi‐constraint graph partitioning , 2002, Concurr. Comput. Pract. Exp..

[20]  David P. Bunde,et al.  Task mapping stencil computations for non-contiguous allocations , 2014, PPoPP '14.

[21]  S. Arunkumar,et al.  Genetic algorithm based heuristics for the mapping problem , 1995, Comput. Oper. Res..

[22]  Laxmikant V. Kalé,et al.  Benefits of Topology Aware Mapping for Mesh Interconnects , 2008, Parallel Process. Lett..

[23]  James A. Ang,et al.  The Alliance for Computing at the Extreme Scale. , 2010 .

[24]  José E. Moreira,et al.  Topology Mapping for Blue Gene/L Supercomputer , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[25]  Jake K. Aggarwal,et al.  A Mapping Strategy for Parallel Processing , 1987, IEEE Transactions on Computers.

[26]  Katherine E. Isaacs,et al.  There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[27]  Franz Franchetti,et al.  Large-scale electronic structure calculations of high-Z metals on the BlueGene/L platform , 2006, SC.

[28]  Jiazheng Zhou,et al.  Hierarchical Mapping for HPC Applications , 2011, IPDPS Workshops.

[29]  Esther M. Arkin,et al.  Processor allocation on Cplant: achieving general processor locality using one-dimensional allocation strategies , 2002 .

[30]  Torsten Hoefler,et al.  Generic topology mapping strategies for large-scale parallel architectures , 2011, ICS '11.

[31]  Scott F. Midkiff,et al.  Heuristic Technique for Processor and Link Assignment in Multicomputers , 1991, IEEE Trans. Computers.

[32]  Jie Meng,et al.  Optimizing communication and cooling costs in HPC data centers via intelligent job allocation , 2013, 2013 International Green Computing Conference Proceedings.

[33]  Bill Nitzberg,et al.  Noncontiguous Processor Allocation Algorithms for Mesh-Connected Multicomputers , 1997, IEEE Trans. Parallel Distributed Syst..