DAOS and Friends: A Proposal for an Exascale Storage System

The DOE Extreme-Scale Technology Acceleration Fast Forward Storage and IO Stack project is going to have significant impact on storage systems design within and beyond the HPC community. With phase two of the project starting, it is an excellent opportunity to explore the complete design and how it will address the needs of extreme scale platforms. This paper examines each layer of the proposed stack in some detail along with cross-cutting topics, such as transactions and metadata management. This paper not only provides a timely summary of important aspects of the design specifications but also captures the underlying reasoning that is not available elsewhere. We encourage the broader community to understand the design, intent, and future directions to foster discussion guiding phase two and the ultimate production storage stack based on this work. An initial performance evaluation of the early prototype implementation is also provided to validate the presented design.

[1]  Jay F. Lofstead,et al.  An Overview of the Sirocco Parallel Storage System , 2016, ISC Workshops.

[2]  P. Nowoczynski,et al.  Zest Checkpoint storage system for large supercomputers , 2008, 2008 3rd Petascale Data Storage Workshop.

[3]  Ron Oldfield,et al.  Extending scalability of collective IO through nessie and staging , 2011, PDSW '11.

[4]  GhemawatSanjay,et al.  The Google file system , 2003 .

[5]  Karsten Schwan,et al.  Extending I/O through high performance data services , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[6]  Carlos Maltzahn,et al.  Efficient, Failure Resilient Transactions for Parallel and Distributed Computing , 2014, 2014 International Workshop on Data Intensive Scalable Computing Systems.

[7]  Carlos Maltzahn,et al.  Consistency and Fault Tolerance Considerations for the Next Iteration of the DOE Fast Forward Storage and IO Project , 2014, 2014 43rd International Conference on Parallel Processing Workshops.

[8]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[9]  Gerd Heber,et al.  An overview of the HDF5 technology suite and its applications , 2011, AD '11.

[10]  Karsten Schwan,et al.  LIVE data workspace: A flexible, dynamic and extensible platform for petascale applications , 2007, 2007 IEEE International Conference on Cluster Computing.

[11]  Sorin Faibish,et al.  Jitter-free co-processing on a prototype exascale storage stack , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[12]  Feiyi Wang,et al.  Performance and scalability evaluation of the Ceph parallel file system , 2013, PDSW@SC.

[13]  Josef Bacik,et al.  BTRFS: The Linux B-Tree Filesystem , 2013, TOS.

[14]  Carlos Maltzahn,et al.  Efficient transactions for parallel data movement , 2013, PDSW@SC.

[15]  Rolf Riesen,et al.  Lightweight I/O for Scientific Applications , 2006, 2006 IEEE International Conference on Cluster Computing.

[16]  Robert B. Ross,et al.  Mercury: Enabling remote procedure call for high-performance computing , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[17]  Wei-keng Liao,et al.  Scaling parallel I/O performance through I/O delegate and caching system , 2008, HiPC 2008.

[18]  Karsten Schwan,et al.  Just in time: adding value to the IO pipelines of high performance applications with JITStaging , 2011, HPDC '11.

[19]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[20]  Karsten Schwan,et al.  Adaptable, metadata rich IO methods for portable high performance IO , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[21]  Theodore L. Willke,et al.  GraphBuilder: scalable graph ETL framework , 2013, GRADES.

[22]  Carlos Maltzahn,et al.  POSTER: An innovative storage stack addressing extreme scale platforms and Big Data applications , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[23]  Carl Smith,et al.  NFS Version 3: Design and Implementation , 1994, USENIX Summer.

[24]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[25]  Karsten Schwan,et al.  PreDatA – preparatory data analytics on peta-scale machines , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[26]  Matthew L. Curry,et al.  Motivation and Design of the Sirocco Storage System Version 1.0. , 2015 .

[27]  John Bent,et al.  Storage challenges at Los Alamos National Lab , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[28]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[29]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[30]  Karsten Schwan,et al.  Managing Variability in the IO Performance of Petascale Storage Systems , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[31]  Andrea C. Arpaci-Dusseau,et al.  End-to-end Data Integrity for File Systems: A ZFS Case Study , 2010, FAST.

[32]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[33]  Karsten Schwan,et al.  D2T: Doubly Distributed Transactions for High Performance and Distributed Computing , 2012, 2012 IEEE International Conference on Cluster Computing.

[34]  Sorin Faibish,et al.  BAD-check: bulk asynchronous distributed checkpointing , 2015, PDSW '15.