Evaluation of dataflow programmingmodels for electronic structure theory

Funding information US Air Force Office of Scientific Research, Grant/Award Number: FA9550-12-1-0476 Summary Dataflow programming models have been growing in popularity as a means to deliver a good balance between performance and portability in the post-petascale era. In this paper, we evaluate different dataflow programming models for electronic structure methods and compare them in terms of programmability, resource utilization, and scalability. In particular, we evaluate two programming paradigms for expressing scientific applications in a dataflow form: (1) explicit dataflow, where the dataflow is specified explicitly by the developer, and (2) implicit dataflow, where a task scheduling runtime derives the dataflow using per-task data-access information embedded in a serial program. We discuss our findings and present a thorough experimental analysis using methods from the NWChem quantum chemistry application as our case study, and OpenMP, StarPU, and PaRSEC as the task-based runtimes that enable the different forms of dataflow execution. Furthermore, we derive an abstract model to explore the limits of the different dataflow programming paradigms.

[1]  Andrew S. Grimshaw,et al.  Dynamic, object-oriented parallel processing , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[2]  Martin Rinard,et al.  The design, implementation and evaluation of Jade: a portable, implicitly parallel programming language , 1994 .

[3]  Michel Cosnard,et al.  Automatic task graph generation techniques , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[4]  Jean-François Méhaut,et al.  PM2: Parallel Multithreaded Machine. A Computing Environment for Distributed Architectures , 1995, PARCO.

[5]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[6]  Katherine A. Yelick,et al.  Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..

[7]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[8]  Allen D. Malony,et al.  SMARTS: exploiting temporal locality and parallelism through vertical execution , 1999, ICS '99.

[9]  Steven G. Parker,et al.  Uintah: a massively parallel problem solving environment , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[10]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[11]  Jarek Nieplocha,et al.  Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[12]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[13]  Eduard Ayguadé,et al.  An Extension of the StarSs Programming Model for Platforms with Multiple GPUs , 2009, Euro-Par.

[14]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[15]  Alejandro Duran,et al.  The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[16]  Eduard Ayguadé,et al.  Hierarchical Task-Based Programming With StarSs , 2009, Int. J. High Perform. Comput. Appl..

[17]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[18]  Pietro Cicotti Tarragon : a programming model for latency-hiding scientific computations , 2011 .

[19]  Sriram Krishnamoorthy,et al.  Scalable implementations of accurate excited-state coupled cluster theories: Application of high-level methods to porphyrin-based systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[20]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[21]  Alejandro Duran,et al.  Productive Programming of GPU Clusters with OmpSs , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[22]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Scott B. Baden,et al.  Bamboo -- Translating MPI applications to a latency-tolerant, data-driven form , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[24]  Thomas Hérault,et al.  From Serial Loops to Parallel Execution on Distributed Systems , 2012, Euro-Par.

[25]  Vivek Sarkar,et al.  Integrating Asynchronous Task Parallelism with MPI , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[26]  Thomas Hérault,et al.  PTG: An Abstraction for Unhindered Parallelism , 2014, 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing.

[27]  Avi Mendelson,et al.  TERAFLUX: Harnessing dataflow in next generation teradevices , 2014, Microprocess. Microsystems.

[28]  Thomas Hérault,et al.  POSTER: Utilizing dataflow-based execution for coupled cluster methods , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[29]  George Bosilca,et al.  PaRSEC in Practice: Optimizing a Legacy Chemistry Application through Distributed Task-Based Execution , 2015, 2015 IEEE International Conference on Cluster Computing.

[30]  Benoît Meister,et al.  The Open Community Runtime: A runtime system for extreme scale computing , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[31]  Emmanuel Agullo,et al.  Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model , 2017 .

[32]  Heike Jagode,et al.  Dataflow Programming Paradigms for Computational Chemistry Methods , 2017 .

[33]  Jack J. Dongarra,et al.  Accelerating NWChem Coupled Cluster through dataflow-based execution , 2018, Int. J. High Perform. Comput. Appl..