DAOS for Extreme-scale Systems in Scientific Applications

Exascale I/O initiatives will require new and fully integrated I/O models which are capable of providing straightforward functionality, fault tolerance and efficiency. One solution is the Distributed Asynchronous Object Storage (DAOS) technology, which is primarily designed to handle the next generation NVRAM and NVMe technologies envisioned for providing a high bandwidth/IOPS storage tier close to the compute nodes in an HPC system. In conjunction with DAOS, the HDF5 library, an I/O library for scientific applications, will support end-to-end data integrity, fault tolerance, object mapping, index building and querying. This paper details the implementation and performance of the HDF5 library built over DAOS by using three representative scientific application codes.

[1]  Robert E. McGrath,et al.  HDF: an update and future directions , 1999, IEEE 1999 International Geoscience and Remote Sensing Symposium. IGARSS'99 (Cat. No.99CH36293).

[2]  David E. Keyes,et al.  Exaflop/s: The why and the how , 2011 .

[3]  Kshitij Mehta,et al.  A Plugin for HDF5 Using PLFS for Improved I/O Performance and Semantic Analysis , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[4]  Rudolf Eigenmann,et al.  Parallel I/O Library (PIO) , 2011, Encyclopedia of Parallel Computing.

[5]  Robert B. Ross,et al.  Impact of data placement on resilience in large-scale object storage systems , 2016, 2016 32nd Symposium on Mass Storage Systems and Technologies (MSST).

[6]  Gerd Heber,et al.  An overview of the HDF5 technology suite and its applications , 2011, AD '11.

[7]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Jesús Carretero,et al.  Making the case for reforming the I/O software stack of extreme-scale systems , 2017, Adv. Eng. Softw..

[9]  Dan Bonachea GASNet Specification, v1.1 , 2002 .

[10]  Robert B. Ross,et al.  Mercury: Enabling remote procedure call for high-performance computing , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[11]  Alex Brooks,et al.  Argobots: A Lightweight Low-Level Threading and Tasking Framework , 2018, IEEE Transactions on Parallel and Distributed Systems.

[12]  Carlos Maltzahn,et al.  Automatic and transparent I/O optimization with storage integrated application runtime support , 2015, PDSW '15.

[13]  Carlos Maltzahn,et al.  Efficient transactions for parallel data movement , 2013, PDSW@SC.

[14]  Robert W. Robey,et al.  Cell-based Adaptive Mesh Refinement on the GPU with Applications to Exascale Supercomputing , 2011 .