An Introduction to hpxMP - A Modern OpenMP Implementation Leveraging Asynchronous Many-Tasking System

Asynchronous Many-task (AMT) runtime systems have gained increasing acceptance in the HPC community due to the performance improvements offered by fine-grained tasking runtime systems. At the same time, C++ standardization efforts are focused on creating higher-level interfaces able to replace OpenMP or OpenACC in modern C++ codes. These higher level functions have been adopted in standards conforming runtime systems such as HPX, giving users the ability to simply utilize fork-join parallelism in their own codes. Despite innovations in runtime systems and standardization efforts users face enormous challenges porting legacy applications. Not only must users port their own codes, but often users rely on highly optimized libraries such as BLAS and LAPACK which use OpenMP for parallization. Current efforts to create smooth migration paths have struggled with these challenges, especially as the threading systems of AMT libraries often compete with the treading system of OpenMP. To overcome these issues, our team has developed hpxMP, an implementation of the OpenMP standard, which utilizes the underlying AMT system to schedule and manage tasks. This approach leverages the C++ interfaces exposed by HPX and allows users to execute their applications on an AMT system without changing their code. In this work, we compare hpxMP with Clang’s OpenMP library with four linear algebra benchmarks of the Blaze C++ library. While hpxMP is often not able to reach the same performance, we demonstrate viability for providing a smooth migration for applications but have to be extended to benefit from a more general task based programming model.

[1]  Karl Rupp,et al.  ViennaCL - Linear Algebra Library for Multi- and Many-Core Architectures , 2016, SIAM J. Sci. Comput..

[2]  Qian Wang,et al.  AUGEM: Automatically generate high performance Dense Linear Algebra kernels on x86 CPUs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3]  Hartmut Kaiser,et al.  Using SYCL as an Implementation Framework for HPX.Compute , 2017, IWOCL.

[4]  J. Ramanujam,et al.  Redesigning OP2 Compiler to Use HPX Runtime Asynchronous Techniques , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[5]  Hartmut Kaiser,et al.  Methodology for Adaptive Active Message Coalescing in Task Based Runtime Systems , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[6]  David A. Bader Evolving MPI+X Toward Exascale , 2016, Computer.

[7]  Robert A. Alfieri,et al.  An Efficient Kernel-Based Implementation of POSIX Threads , 1994, USENIX Summer.

[8]  J. Ramanujam,et al.  A Massively Parallel Distributed N-body Application Implemented with HPX , 2016, 2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA).

[9]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[10]  Dirk Pflüger,et al.  Harnessing billions of tasks for a scalable portable hydrodynamic simulation of the merger of two stars , 2019, Int. J. High Perform. Comput. Appl..

[11]  Stephen L. Olivier,et al.  Toward an evolutionary task parallel integrated MPI + X programming model , 2015, PMAM@PPoPP.

[12]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[13]  Conrad Sanderson,et al.  Armadillo: a template-based C++ library for linear algebra , 2016, J. Open Source Softw..

[14]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[15]  Thomas Heller,et al.  HPX – An open source C++ Standard Library for Parallelism and Concurrency , 2023, ArXiv.

[16]  Jeanine Cook,et al.  The Performance Implication of Task Size for Applications on the HPX Runtime System , 2015, 2015 IEEE International Conference on Cluster Computing.