An Evolutionary Path to Object Storage Access

High-performance computing (HPC) storage systems typically consist of an object storage system that is accessed via the POSIX file interface. However, rapid increases in system scales and storage system complexity have uncovered a number of limitations in this model. In particular, applications and libraries are limited in their ability to partition data into units with independent concurrency control, and mapping complex science data models into the POSIX file model is inconvenient at best. In this paper we propose an alternative interface for use by applications and libraries that provides direct access to underlying storage objects. This model allows applications and libraries to organize storage access around these objects in order to avoid lock contention without needing to create many separate files. Additionally, complex data models are more readily organized into multiple object data streams, simplifying the storage of variable-length data and allowing a choice of degree of parallelism related to access needs. Our approach provides for datasets stored in this new model to coexist with POSIX files, allowing evolution to the new model over time. We apply these concepts in the PVFS, PLFS, and Parallel netCDF packages to prototype the model and describe our experiences.

[1]  Robert B. Ross,et al.  Optimization Techniques at the I/O Forwarding Layer , 2010, 2010 IEEE International Conference on Cluster Computing.

[2]  Jeffrey S. Vetter,et al.  Exploiting Lustre File Joining for Effective Collective IO , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[3]  Robert Latham,et al.  End-to-End Study of Parallel Volume Rendering on the IBM Blue Gene/P , 2009, 2009 International Conference on Parallel Processing.

[4]  Hal Berghel,et al.  Wading into alternate data streams , 2004, CACM.

[5]  John Bent,et al.  PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[6]  Alex Rapaport,et al.  Mpi-2: extensions to the message-passing interface , 1997 .

[7]  David Kotz,et al.  The galley parallel file system , 1997, ICS '96.

[8]  Robert Latham,et al.  Understanding and improving computational science storage access through continuous characterization , 2011, 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST).

[9]  Jianwei Li,et al.  Parallel netCDF: A High-Performance Scientific I/O Interface , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[10]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[11]  Ieee Standards Board System application program interface (API) (C language) , 1990 .

[12]  Robert B. Ross,et al.  Small-file access in parallel file systems , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[13]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[14]  Bin Zhou,et al.  Scalable Performance of the Panasas Parallel File System , 2008, FAST.

[15]  Karsten Schwan,et al.  Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS) , 2008, CLADE '08.

[16]  Robert B. Ross,et al.  On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[17]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[18]  T. H. Merrett,et al.  A storage scheme for extendible arrays , 2005, Computing.

[19]  P. Nowoczynski,et al.  Zest Checkpoint storage system for large supercomputers , 2008, 2008 3rd Petascale Data Storage Workshop.