MetaKV: A Key-Value Store for Metadata Management of Distributed Burst Buffers

Distributed burst buffers are a promising storage architecture for handling I/O workloads for exascale computing. Their aggregate storage bandwidth grows linearly with system node count. However, although scientific applications can achieve scalable write bandwidth by having each process write to its node-local burst buffer, metadata challenges remain formidable, especially for files shared across many processes. This is due to the need to track and organize file segments across the distributed burst buffers in a global index. Because this global index can be accessed concurrently by thousands or more processes in a scientific application, the scalability of metadata management is a severe performance-limiting factor. In this paper, we propose MetaKV: a key-value store that provides fast and scalable metadata management for HPC metadata workloads on distributed burst buffers. MetaKV complements the functionality of an existing key-value store with specialized metadata services that efficiently handle bursty and concurrent metadata workloads: compressed storage management, supervised block clustering, and log-ring based collective message reduction. Our experiments demonstrate that MetaKV outperforms the state-of-the-art key-value stores by a significant margin. It improves put and get metadata operations by as much as 2.66× and 6.29×, respectively, and the benefits of MetaKV increase with increasing metadata workload demand.

[1]  Song Jiang,et al.  IOrchestrator: Improving the Performance of Multi-node I/O Systems via Inter-Server Coordination , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[3]  Sandeep K. S. Gupta,et al.  DASH: a Recipe for a Flash-based Data Intensive Supercomputer , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Karsten Schwan,et al.  Six degrees of scientific data: reading patterns for extreme scale science IO , 2011, HPDC '11.

[5]  Wang Teng,et al.  An Ephemeral Burst-Buffer File System for Scientific Applications , 2016 .

[6]  Jianwei Li,et al.  Parallel netCDF: A High-Performance Scientific I/O Interface , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[7]  John Bent,et al.  MDHIM: A Parallel Key/Value Framework for HPC , 2015, HotStorage.

[8]  Prabhat,et al.  FastBit: interactively searching massive data , 2009 .

[9]  John Bent,et al.  PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[10]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[11]  Kai Ren,et al.  IndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Lofstead Jay,et al.  DAOS and Friends: A Proposal for an Exascale Storage System , 2016 .

[13]  Rob VanderWijngaart,et al.  NAS Parallel Benchmarks I/O Version 2.4. 2.4 , 2002 .

[14]  Karsten Schwan,et al.  Adaptable, metadata rich IO methods for portable high performance IO , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[15]  Carlos Maltzahn,et al.  I/O acceleration with pattern detection , 2013, HPDC.

[16]  Scott Klasky,et al.  DataSpaces: an interaction and coordination framework for coupled simulation workflows , 2012, HPDC '10.

[17]  Ross Miller,et al.  Comparative I/O workload characterization of two leadership class storage clusters , 2015, PDSW '15.

[18]  John Shalf,et al.  Using IOR to analyze the I/O Performance for HPC Platforms , 2007 .

[19]  Samuel Lang,et al.  GIGA+: scalable directories for shared file systems , 2007, PDSW '07.

[20]  George Bosilca,et al.  Binomial Graph: A Scalable and Fault-Tolerant Logical Network Topology , 2007, ISPA.

[21]  Margo I. Seltzer,et al.  Berkeley DB , 1999, USENIX Annual Technical Conference, FREENIX Track.

[22]  Ke Wang,et al.  ZHT: A Light-Weight Reliable Persistent Dynamic Scalable Zero-Hop Distributed Hash Table , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[23]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[24]  Jeffrey S. Vetter,et al.  Performance characterization and optimization of parallel I/O on the Cray XT , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[25]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[26]  Scott A. Brandt,et al.  Dynamic Metadata Management for Petabyte-Scale File Systems , 2004, Proceedings of the ACM/IEEE SC2004 Conference.