Toward an evolutionary task parallel integrated MPI + X programming model

The Bulk Synchronous Parallel programming model is showing performance limitations at high processor counts. We propose over-decomposition of the domain, operated on as tasks, to smooth out utilization of the computing resource, in particular the node interconnect and processing cores, and hide intra- and inter-node data movement. Our approach maintains the existing coding style commonly employed in computational science and engineering applications. Although we show improved performance on existing computers, up to 131,072 processor cores, the effectiveness of this approach on expected future architectures will require the continued evolution of capabilities throughout the codesign stack. Success then will not only result in decreased time to solution, but would also make better use of the hardware capabilities and reduce power and energy requirements, while fundamentally maintaining the current code configuration strategy.

[1]  D. Roweth,et al.  Cray XC ® Series Network , 2012 .

[2]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[3]  Ahmad Afsahi,et al.  A Speculative and Adaptive MPI Rendezvous Protocol Over RDMA-enabled Interconnects , 2009, International Journal of Parallel Programming.

[4]  Courtenay T. Vaughan,et al.  Using the Cray Gemini Performance Counters. , 2013 .

[5]  Eduard Ayguadé,et al.  Overlapping communication and computation by using a hybrid MPI/SMPSs approach , 2010, ICS '10.

[6]  Brian Vinter,et al.  Using overdecomposition to overlap communication latencies with computation and take advantage of SMT processors , 2006, 2006 International Conference on Parallel Processing Workshops (ICPPW'06).

[7]  Carl Edward Oliver,et al.  Scientific Discovery through Advanced Computing , 2001, International Conference on Computational Science.

[8]  Keith D. Underwood,et al.  SeaStar Interconnect: Balanced Bandwidth for Scalable Performance , 2006, IEEE Micro.

[9]  Mahesh Rajan,et al.  Application-Driven Acceptance of Cielo an XE6 Petascale Capability Platform. , 2011 .

[10]  Douglas Doerfler,et al.  Measuring MPI Send and Receive Overhead and Application Availability in High Performance Network Interfaces , 2006, PVM/MPI.

[11]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[12]  Sandia Report,et al.  MiniGhost: A Miniapp for Exploring Boundary Exchange Strategies Using Stencil Computations in Scientific Parallel Computing , 2012 .

[13]  Stephen W. Poole,et al.  Overlapping computation and communication: Barrier algorithms and ConnectX-2 CORE-Direct capabilities , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[14]  Richard F. Barrett,et al.  A Taxonomy of MPI-Oriented Usage Models in Parallelized Scientific Codes , 2009, Software Engineering Research and Practice.

[15]  Courtenay T. Vaughan,et al.  Reducing the Bulk in the Bulk Synchronous Parallel Model , 2013, Parallel Process. Lett..

[16]  Mauricio Araya-Polo,et al.  Towards a Multi-Level Cache Performance Model for 3D Stencil Computation , 2011, ICCS.

[17]  Douglas Thain,et al.  Qthreads: An API for programming with millions of lightweight threads , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[18]  Emmanuel Agullo,et al.  Tile QR factorization with parallel panel processing for multicore architectures , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[19]  William J. Dally,et al.  Technology-Driven, Highly-Scalable Dragonfly Topology , 2008, 2008 International Symposium on Computer Architecture.

[20]  Torsten Hoefler,et al.  Implementation and performance analysis of non-blocking collective operations for MPI , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[21]  C. T. Vaughan,et al.  Assessing the role of mini-applications in predicting key performance characteristics of scientific and engineering applications , 2015, J. Parallel Distributed Comput..

[22]  Jack J. Dongarra,et al.  Towards dense linear algebra for hybrid GPU accelerated manycore systems , 2009, Parallel Comput..

[23]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[24]  Qingyu Meng,et al.  Scalable large‐scale fluid–structure interaction solvers in the Uintah framework via hybrid task‐based parallelism algorithms , 2014, Concurr. Comput. Pract. Exp..

[25]  Vivek Sarkar,et al.  Integrating Asynchronous Task Parallelism with MPI , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[26]  Apan Qasem,et al.  Understanding stencil code performance on multicore architectures , 2011, CF '11.

[27]  Graph Topology MPI at Exascale , 2010 .

[28]  A Thesis,et al.  Tiling Stencil Computations to Maximize Parallelism , 2013 .

[29]  Stephen L. Olivier,et al.  Early Experiences Co-Scheduling Work and Communication Tasks for Hybrid MPI+X Applications , 2014, 2014 Workshop on Exascale MPI at Supercomputing Conference.