Increasing the degree of parallelism using speculative execution in task-based runtime systems

Task-based programming models have demonstrated their efficiency in the development of scientific applications on modern high-performance platforms. They allow delegation of the management of parallelization to the runtime system (RS), which is in charge of the data coherency, the scheduling, and the assignment of the work to the computational units. However, some applications have a limited degree of parallelism such that no matter how efficient the RS implementation, they may not scale on modern multicore CPUs. In this paper, we propose using speculation to unleash the parallelism when it is uncertain if some tasks will modify data, and we formalize a new methodology to enable speculative execution in a graph of tasks. This description is partially implemented in our new C++ RS called SPETABARU, which is capable of executing tasks in advance if some others are not certain to modify the data. We study the behavior of our approach to compute Monte Carlo and replica exchange Monte Carlo simulations.

[1]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[2]  Michel Cosnard,et al.  Proceedings of the 28th Annual Hawaii International Conference on System Sciences- 1995 Automatic Task Graph Generation Techniques , 2022 .

[3]  Michel Cosnard,et al.  Automatic task graph generation techniques , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[4]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[5]  Todd C. Mowry,et al.  The potential for using thread-level data speculation to facilitate automatic parallelization , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[6]  Sandhya Dwarkadas,et al.  Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference , 2002, Bioinform..

[7]  Holger H. Hoos,et al.  A replica exchange Monte Carlo algorithm for protein folding in the HP model , 2007, BMC Bioinformatics.

[8]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[9]  Gerhard Hummer,et al.  Coarse-grained models for simulations of multiprotein complexes: application to ubiquitin binding. , 2008, Journal of molecular biology.

[10]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[11]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[12]  Alejandro Duran,et al.  The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[13]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[14]  Michael Bachmann,et al.  A GPU approach to parallel replica-exchange polymer simulations , 2011 .

[15]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[16]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Bruno Raffin,et al.  XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[18]  Xuebin Chi,et al.  Parallel Metropolis Coupled Markov Chain Monte Carlo for Isolation with Migration Model , 2013 .

[19]  Thomas Hérault,et al.  PTG: An Abstraction for Unhindered Parallelism , 2014, 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing.

[20]  Martin Tillenius,et al.  SuperGlue: A Shared Memory Framework Using Data Versioning for Dependency-Aware Task-Based Parallelization , 2015, SIAM J. Sci. Comput..

[21]  Cong Yan,et al.  A scalable architecture for ordered parallelism , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[22]  Emmanuel Agullo,et al.  Implementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime Systems , 2016, ACM Trans. Math. Softw..

[23]  Emmanuel Agullo,et al.  Task‐based FMM for heterogeneous architectures , 2016, Concurr. Comput. Pract. Exp..

[24]  Shantenu Jha,et al.  RepEx: A Flexible Framework for Scalable Replica Exchange Molecular Dynamics Simulations , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[25]  Emmanuel Agullo,et al.  Bridging the Gap Between OpenMP and Task-Based Runtime Systems for the Fast Multipole Method , 2017, IEEE Transactions on Parallel and Distributed Systems.

[26]  Manuel Selva,et al.  APOLLO: Automatic speculative POLyhedral Loop Optimizer , 2017 .

[27]  Manuel Selva,et al.  Full runtime polyhedral optimizing loop transformations with the generation, instantiation, and scheduling of code‐bones , 2017, Concurr. Comput. Pract. Exp..

[28]  Daniele Cesarini,et al.  Unleashing Fine-Grained Parallelism on Embedded Many-Core Accelerators with Lightweight OpenMP Tasking , 2018, IEEE Transactions on Parallel and Distributed Systems.

[29]  Kostas Katrinis,et al.  A taxonomy of task-based parallel programming technologies for high-performance computing , 2018, The Journal of Supercomputing.

[30]  Marc Casas,et al.  TaskGenX: A Hardware-Software Proposal for Accelerating Task Parallelism , 2018, ISC.

[31]  Guido Araujo,et al.  Using Hardware-Transactional-Memory Support to Implement Thread-Level Speculation , 2018, IEEE Transactions on Parallel and Distributed Systems.