Integrating Asynchronous Task Parallelism with MPI

Effective combination of inter-node and intra-node parallelism is recognized to be a major challenge for future extreme-scale systems. Many researchers have demonstrated the potential benefits of combining both levels of parallelism, including increased communication-computation overlap, improved memory utilization, and effective use of accelerators. However, current “hybrid programming” approaches often require significant rewrites of application code and assume a high level of programmer expertise. Dynamic task parallelism has been widely regarded as a programming model that combines the best of performance and programmability for shared-memory programs. For distributed-memory programs, most users rely on efficient implementations of MPI. In this paper, we propose HCMPI (Habanero-C MPI), an integration of the Habanero-C dynamic task-parallel programming model with the widely used MPI message-passing interface. All MPI calls are treated as asynchronous tasks in this model, thereby enabling unified handling of messages and tasking constructs. For programmers unfamiliar with MPI, we introduce distributed data-driven futures (DDDFs), a new data-flow programming model that seamlessly integrates intra-node and inter-node data-flow parallelism without requiring any knowledge of MPI. Our novel runtime design for HCMPI and DDDFs uses a combination of dedicated communication and computation specific worker threads. We evaluate our approach on a set of micro-benchmarks as well as larger applications and demonstrate better scalability compared to the most efficient MPI implementations, while offering a unified programming model to integrate asynchronous task parallelism with distributed-memory parallelism.

[1]  Santosh Pande,et al.  Work Stealing for Multi-core HPC Clusters , 2011, Euro-Par.

[2]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[3]  Katherine Yelick,et al.  UPC Language Specifications V1.1.1 , 2003 .

[4]  Sachin S. Sapatnekar,et al.  A Framework for Exploiting Task and Data Parallelism on Distributed Memory Multicomputers , 1997, IEEE Trans. Parallel Distributed Syst..

[5]  Alexandros Stamatakis,et al.  Hybrid MPI/Pthreads parallelization of the RAxML phylogenetics code , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[6]  Vivek Sarkar,et al.  Data-Driven Tasks and Their Implementation , 2011, 2011 International Conference on Parallel Processing.

[7]  Georg Hager,et al.  Hybrid MPI and OpenMP Parallel Programming , 2006, PVM/MPI.

[8]  Vivek Sarkar,et al.  Integrating MPI with Asynchronous Task Parallelism , 2011, EuroMPI.

[9]  Vivek Sarkar,et al.  Communication Optimizations for Distributed-Memory X10 Programs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[10]  Vivek Sarkar,et al.  Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement , 2009, LCPC.

[11]  Katherine A. Yelick,et al.  Hybrid PGAS runtime support for multicore nodes , 2010, PGAS '10.

[12]  Philippe Olivier Alexandre Navaux,et al.  Challenges and Issues of Supporting Task Parallelism in MPI , 2010, EuroMPI.

[13]  David A. Bader,et al.  A novel FDTD application featuring OpenMP-MPI hybrid parallelization , 2004 .

[14]  Daniel Etiemble,et al.  A framework for an automatic hybrid MPI+OpenMP code generation , 2011, SpringSim.

[15]  Haoqiang Jin,et al.  Performance Characteristics of the Multi-Zone NAS Parallel Benchmarks , 2004, IPDPS.

[16]  Hasan U. Akay,et al.  Hybrid Parallelism for CFD Simulations: Combining MPI with OpenMP , 2009 .

[17]  Sriram Krishnamoorthy,et al.  Lifeline-based global load balancing , 2011, PPoPP '11.

[18]  Vivek Sarkar,et al.  Software challenges in extreme scale systems , 2009 .

[19]  Vivek Sarkar,et al.  Phasers: a unified deadlock-free construct for collective and point-to-point synchronization , 2008, ICS '08.

[20]  Stephen L. Olivier,et al.  UTS: An Unbalanced Tree Search Benchmark , 2006, LCPC.

[21]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[22]  Vivek Sarkar,et al.  Hardware and Software Tradeoffs for Task Synchronization on Manycore Architectures , 2011, Euro-Par.

[23]  Guillaume Mercier,et al.  Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[24]  Yvon Jégou,et al.  Task Migration and Fine Grain Parallelism on Distributed Memory Architectures , 1997, PaCT.

[25]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[26]  Rajeev Thakur,et al.  Test suite for evaluating performance of multithreaded MPI communication , 2009, Parallel Comput..

[27]  Vivek Sarkar,et al.  Phaser accumulators: A new reduction construct for dynamic parallelism , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[28]  Xingfu Wu,et al.  Performance characteristics of hybrid MPI/OpenMP implementations of NAS parallel benchmarks SP and BT on large-scale multicore supercomputers , 2011, PERV.

[29]  Vivek Sarkar,et al.  Habanero-Java: the new adventures of old X10 , 2011, PPPJ.

[30]  Vivek Sarkar,et al.  Hierarchical phasers for scalable synchronization and reductions in dynamic parallelism , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[31]  Stephen L. Olivier,et al.  Dynamic Load Balancing of Unbalanced Computations Using Message Passing , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[32]  Eduard Ayguadé,et al.  Overlapping communication and computation by using a hybrid MPI/SMPSs approach , 2010, ICS '10.

[33]  Yi Guo,et al.  Work-first and help-first scheduling policies for async-finish task parallelism , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.