Partial Replication of Metadata to Achieve High Metadata Availability in Parallel File Systems

This paper presents PARTE, a prototype parallel file system with active/standby configured metadata servers (MDSs). PARTE replicates and distributes a part of files' metadata to the corresponding metadata stripes on the storage servers (OSTs) with a per-file granularity, meanwhile the client file system (client) keeps certain sent metadata requests. If the active MDS has crashed for some reason, these client backup requests will be replayed by the standby MDS to restore the lost metadata. In case one or more backup requests are lost due to network problems or dead clients, the latest metadata saved in the associated metadata stripes will be used to construct consistent and up-to-date metadata on the standby MDS. Moreover, the clients and OSTs can work in both normal mode and recovery mode in the PARTE file system. This differs from conventional active/standby configured MDSs parallel file systems, which hang all I/O requests and metadata requests during restoration of the lost metadata. In the PARTE file system, previously connected clients can continue to perform I/O operations and relevant metadata operations, because OSTs work as temporary MDSs during that period by using the replicated metadata in the relevant metadata stripes. Through examination of experimental results, we show the feasibility of the main ideas presented in this paper for providing high availability metadata service with only a slight overhead effect on I/O performance. Furthermore, since previously connected clients are never hanged during metadata recovery, in contrast to conventional systems, a better overall I/O data throughput can be achieved with PARTE.

[1]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[2]  Leonid Oliker,et al.  Investigation of leading HPC I/O performance using a scientific-application derived benchmark , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[3]  Louise E. Moser,et al.  Extended virtual synchrony , 1994, 14th International Conference on Distributed Computing Systems.

[4]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[5]  Michael Isard,et al.  Autopilot: automatic data center management , 2007, OPSR.

[6]  Gregory R. Ganger,et al.  A Transparently-Scalable Metadata Service for the Ursa Minor Storage System , 2010, USENIX Annual Technical Conference.

[7]  Pete Wyckoff,et al.  An OSD-based approach to managing directory operations in parallel file systems , 2008, 2008 IEEE International Conference on Cluster Computing.

[8]  CHIRAYU : A Highly Available Metadata Server for Object Based Storage Cluster File System , 2003 .

[9]  Song Jiang,et al.  QoS support for end users of I/O-intensive applications using shared storage systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Scott A. Brandt,et al.  Dynamic Metadata Management for Petabyte-Scale File Systems , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[11]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[12]  Xin Chen,et al.  Symmetric active/active metadata service for high availability parallel file systems , 2009, J. Parallel Distributed Comput..

[13]  Martin Casdagli,et al.  Nonlinear prediction of chaotic time series , 1989 .

[14]  Hui Xiong,et al.  A Design of Metadata Server Cluster in Large Distributed Object-based Storage , 2004, MSST.

[15]  Hong Jiang,et al.  HBA: Distributed Metadata Management for Large Cluster-Based Storage Systems , 2008, IEEE Transactions on Parallel and Distributed Systems.

[16]  Bo Dong,et al.  Hadoop high availability through metadata replication , 2009, CloudDB@CIKM.

[17]  Osamu Tatebe,et al.  Gfarm Grid File System , 2010, New Generation Computing.

[18]  Christian Engelmann,et al.  Evaluating the Shared Root File System Approach for Diskless High-Performance Computing Systems , 2009 .

[19]  Zhuan Chen,et al.  Replication-Based Highly Available Metadata Management for Cluster File Systems , 2010, 2010 IEEE International Conference on Cluster Computing.

[20]  William H. Press,et al.  Numerical Recipes: The Art of Scientific Computing , 1987 .

[21]  Michael Isard,et al.  TidyFS: A Simple and Small Distributed File System , 2011, USENIX Annual Technical Conference.

[22]  Felix Hupfeld,et al.  BabuDB: Fast and Efficient File System Metadata Storage , 2010, 2010 International Workshop on Storage Network Architecture and Parallel I/Os.

[23]  Xubin He,et al.  Symmetric Active/Active High Availability for High-Performance Computing System Services: Accomplishments and Limitations , 2006, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).