Easy Dataflow Programming in Clusters with UPC++ DepSpawn

The Partitioned Global Address Space (PGAS) programming model is one of the most relevant proposals to improve the ability of developers to exploit distributed memory systems. However, despite its important advantages with respect to the traditional message-passing paradigm, PGAS has not been yet widely adopted. We think that PGAS libraries are more promising than languages because they avoid the requirement to (re)write the applications using them, with the implied uncertainties related to portability and interoperability with the vast amount of APIs and libraries that exist for widespread languages. Nevertheless, the need to embed these libraries within a host language can limit their expressiveness and very useful features can be missing. This paper contributes to the advance of PGAS by enabling the simple development of arbitrarily complex task-parallel codes following a dataflow approach on top of the PGAS UPC++ library, implemented in C++. In addition, our proposal, called UPC++ DepSpawn, relies on an optimized multithreaded runtime that provides very competitive performance, as our experimental evaluation shows.

[1]  J. A. Francis,et al.  Titanium , 2019, Materials Science and Technology.

[2]  Jason Duell,et al.  Productivity and performance using partitioned global address space languages , 2007, PASCO '07.

[3]  Eduard Ayguadé,et al.  DaSH: A benchmark suite for hybrid dataflow and shared memory programming models , 2015, Parallel Comput..

[4]  Jesús Labarta,et al.  A high‐productivity task‐based programming model for clusters , 2012, Concurr. Comput. Pract. Exp..

[5]  Daniel S. Katz,et al.  Swift/T: Large-Scale Application Composition via Distributed-Memory Dataflow Processing , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[6]  James Demmel,et al.  ScaLAPACK: A Linear Algebra Library for Message-Passing Computers , 1997, PPSC.

[7]  Lars Karlsson,et al.  Distributed SBP Cholesky factorization algorithms with near-optimal scheduling , 2009, TOMS.

[8]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[9]  Michael Garland,et al.  Designing a unified programming model for heterogeneous machines , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Emmanuel Agullo,et al.  Harnessing clusters of hybrid nodes with a sequential task-based programming model , 2014 .

[11]  Thorsten Kurth,et al.  MPI usage at NERSC: Present and Future , 2016, EuroMPI.

[12]  Vivek Sarkar,et al.  Data-Driven Tasks and Their Implementation , 2011, 2011 International Conference on Parallel Processing.

[13]  Alexander Aiken,et al.  Regent: a high-productivity programming language for HPC with logical regions , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Hartmut Kaiser,et al.  HPX: A Task Based Programming Model in a Global Address Space , 2014, PGAS.

[15]  Dan Bonachea GASNet Specification, v1.1 , 2002 .

[16]  Vivek Sarkar,et al.  Integrating Asynchronous Task Parallelism with MPI , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[17]  Katherine A. Yelick,et al.  A Local-View Array Library for Partitioned Global Address Space C++ Programs , 2014, ARRAY@PLDI.

[18]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[19]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[20]  Jens Breitbart,et al.  A dataflow-like programming model for future hybrid clusters , 2013, Int. J. Netw. Comput..

[21]  Eduard Ayguadé,et al.  Implementing OmpSs support for regions of data in architectures with multiple address spaces , 2013, ICS '13.

[22]  Basilio B. Fraguela,et al.  A framework for argument-based task synchronization with automatic detection of dependencies , 2013, Parallel Comput..

[23]  Katherine A. Yelick,et al.  UPC++: A PGAS Extension for C++ , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[24]  Juan Touriño,et al.  Performance Evaluation of MPI, UPC and OpenMP on Multicore Architectures , 2009, PVM/MPI.

[25]  Scott B. Baden,et al.  The UPC++ PGAS library for Exascale Computing , 2017, PAW@SC.

[26]  Sreedhar B. Kodali,et al.  The Asynchronous Partitioned Global Address Space Model , 2010 .

[27]  James Demmel,et al.  the Parallel Computing Landscape , 2022 .

[28]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[29]  George Bosilca,et al.  PaRSEC in Practice: Optimizing a Legacy Chemistry Application through Distributed Task-Based Execution , 2015, 2015 IEEE International Conference on Cluster Computing.

[30]  Basilio B. Fraguela,et al.  A Comparison of Task Parallel Frameworks based on Implicit Dependencies in Multi-core Environments , 2017, HICSS.

[31]  Katherine A. Yelick,et al.  Multi-threading and one-sided communication in parallel LU factorization , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[32]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[33]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[34]  Thomas Hérault,et al.  Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[35]  Jarek Nieplocha,et al.  Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[36]  Michel Cosnard,et al.  Automatic task graph generation techniques , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[37]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[38]  Katherine Yelick,et al.  UPC Language Specifications V1.1.1 , 2003 .