Hybrid MPI-thread parallelization of adaptive mesh operations

Development of a hybrid MPI-thread programming system called PCU.Inter-thread message passing, including non-blocking collectives.A novel, scalable termination detection technique for communication rounds.Hybrid parallel scalability to 16K cores on an IBM Blue Gene/Q. Many of the world's leading supercomputer architectures are a hybrid of shared memory and network-distributed memory. Such an architecture lends itself to a hybrid MPI-thread programming model. We first present an implementation of inter-thread message passing based on the MPI and pthread libraries. In addition, we present an efficient implementation of termination detection for communication rounds. We use the term phased message passing to denote the communication interface based on this termination detection. This interface is then used to implement parallel operations for adaptive unstructured meshes, and the performance of resulting applications is compared to pure MPI operation. We also present new workflows enabled by the ability to vary the number of threads during runtime.

[1]  Chen Ji,et al.  A 14.6 billion degrees of freedom, 5 teraflops, 2.5 terabyte earthquake simulation on the Earth Simulator , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[2]  Onkar Sahni,et al.  Scalable Implicit Flow Solver for Realistic Wing Simulations with Flow Control , 2014, Computing in Science & Engineering.

[3]  Torsten Hoefler,et al.  Implementation and performance analysis of non-blocking collective operations for MPI , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[4]  George Karypis,et al.  METIS and ParMETIS , 2011, Encyclopedia of Parallel Computing.

[5]  Rajeev Thakur,et al.  Toward Efficient Support for Multithreaded MPI Communication , 2008, PVM/MPI.

[6]  D. Drew,et al.  PARALLEL ADAPTIVE SIMULATION OF A PLUNGING LIQUID JET ∗ Dedicated to Professor James Glimm on the occasion of his 75th birthday , 2010 .

[7]  Dimitri J. Mavriplis,et al.  Parallel Performance Investigations of an Unstructured Mesh Navier-Stokes Solver , 2000, Int. J. High Perform. Comput. Appl..

[8]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[9]  Mark S. Shephard,et al.  Efficient distributed mesh data structure for parallel automated adaptive analysis , 2006, Engineering with Computers.

[10]  Rajeev Thakur,et al.  Test Suite for Evaluating Performance of MPI Implementations That Support MPI_THREAD_MULTIPLE , 2007, PVM/MPI.

[11]  Onkar Sahni,et al.  Neighborhood communication paradigm to increase scalability in large-scale dynamic scientific applications , 2012, Parallel Comput..

[12]  R. J. Alves de Sousa,et al.  On the use of EAS solid‐shell formulations in the numerical simulation of incremental forming processes , 2011 .

[13]  Michael Klemm,et al.  OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison , 2012, MARC@RWTH.

[14]  Torsten Hoefler,et al.  Non-Blocking Collective Operations for MPI-2 , 2006 .

[15]  Georg Stadler,et al.  Scalable adaptive mantle convection simulation on petascale supercomputers , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Sameer Kumar,et al.  Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[17]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[18]  Onkar Sahni,et al.  Cardiovascular flow simulation at extreme scale , 2010 .

[19]  D.A. White,et al.  Investigation of radar propagation in buildings: A 10 billion element Cartesian-mesh FETD simulation , 2008, 2008 IEEE Antennas and Propagation Society International Symposium.

[20]  Rajeev Thakur,et al.  Test suite for evaluating performance of multithreaded MPI communication , 2009, Parallel Comput..

[21]  Rajeev Thakur,et al.  Issues in Developing a Thread-Safe MPI Implementation , 2006, PVM/MPI.

[22]  Courtenay T. Vaughan,et al.  Zoltan data management services for parallel dynamic applications , 2002, Comput. Sci. Eng..

[23]  Onkar Sahni,et al.  Unstructured mesh partition improvement for implicit finite element at extreme scale , 2010, The Journal of Supercomputing.

[24]  Mark S. Shephard,et al.  Flexible Distributed Mesh Data Structure for Parallel Adaptive Analysis , 2009 .

[25]  Rajeev Thakur,et al.  Improving the Performance of Collective Operations in MPICH , 2003, PVM/MPI.

[26]  Siegfried Benkner,et al.  High-level Support for Hybrid Parallel Execution of C++ Applications Targeting Intel® Xeon Phi™ Coprocessors , 2013, ICCS.

[27]  Onkar Sahni,et al.  Tools to support mesh adaptation on massively parallel computers , 2011, Engineering with Computers.

[28]  Benjamin A. Matthews,et al.  Scalable fully implicit finite element flow solver with application to high-fidelity flow control simulations on a realistic wing design , 2014 .

[29]  Jack J. Dongarra,et al.  Performance analysis of MPI collective operations , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[30]  Min Zhou Petascale adaptive computational fluid dynamics , 2009 .

[31]  W. Walker,et al.  Mpi: a Standard Message Passing Interface 1 Mpi: a Standard Message Passing Interface , 1996 .