High-performance dataflow computing in hybrid memory systems with UPC++ DepSpawn

Dataflow computing is a very attractive paradigm for high-performance computing, given its ability to trigger computations as soon as their inputs are available. UPC++ DepSpawn is a novel task-based library that supports this model in hybrid shared/distributed memory systems on top of a Partitioned Global Address Space environment. While the initial version of the library provided good results, it suffered from a key restriction that heavily limited its performance and scalability. Namely, each process had to consider all the tasks in the application rather than only those of interest to it, an overhead that naturally grows with both the number of processes and tasks in the system. In this paper, this restriction is lifted, enabling our library to provide higher levels of performance. This way, in experiments using 768 cores the performance improved up to 40.1%, the average improvement being 16.1%.

[1]  Jesús Labarta,et al.  A high‐productivity task‐based programming model for clusters , 2012, Concurr. Comput. Pract. Exp..

[2]  Basilio B. Fraguela,et al.  A framework for argument-based task synchronization with automatic detection of dependencies , 2013, Parallel Comput..

[3]  Katherine Yelick,et al.  UPC Language Specifications V1.1.1 , 2003 .

[4]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[5]  Jarek Nieplocha,et al.  Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[6]  Thomas Hérault,et al.  Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[7]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[8]  Basilio B. Fraguela,et al.  A Comparison of Task Parallel Frameworks based on Implicit Dependencies in Multi-core Environments , 2017, HICSS.

[9]  Scott B. Baden,et al.  UPC++: A High-Performance Communication Framework for Asynchronous Computation , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[10]  Basilio B. Fraguela,et al.  Easy Dataflow Programming in Clusters with UPC++ DepSpawn , 2019, IEEE Transactions on Parallel and Distributed Systems.

[11]  Emmanuel Agullo,et al.  Harnessing clusters of hybrid nodes with a sequential task-based programming model , 2014 .

[12]  Thorsten Kurth,et al.  MPI usage at NERSC: Present and Future , 2016, EuroMPI.

[13]  Katherine A. Yelick,et al.  UPC++: A PGAS Extension for C++ , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[14]  Daniel S. Katz,et al.  Swift/T: Large-Scale Application Composition via Distributed-Memory Dataflow Processing , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[15]  Eduard Ayguadé,et al.  Implementing OmpSs support for regions of data in architectures with multiple address spaces , 2013, ICS '13.

[16]  George Bosilca,et al.  PaRSEC in Practice: Optimizing a Legacy Chemistry Application through Distributed Task-Based Execution , 2015, 2015 IEEE International Conference on Cluster Computing.

[17]  William Pugh,et al.  The Omega test: A fast and practical integer programming algorithm for dependence analysis , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[18]  Jason Duell,et al.  Productivity and performance using partitioned global address space languages , 2007, PASCO '07.

[19]  Alexander Aiken,et al.  Regent: a high-productivity programming language for HPC with logical regions , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Michel Cosnard,et al.  Proceedings of the 28th Annual Hawaii International Conference on System Sciences- 1995 Automatic Task Graph Generation Techniques , 2022 .

[21]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[22]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[24]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.