ClusterRAID - architecture and prototype of a distributed fault-tolerant mass storage system for clusters

During the past few years clusters built from commodity off-the-shelf (COTS) components have emerged as the predominant supercomputer architecture. Typically comprising a collection of standard PCs or workstations and an interconnection network, they have replaced the traditionally used integrated systems due to their better price/performance ratio. As paradigms shift from mere computing intensive to I/O intensive applications, mass storage solutions for cluster installations become a more and more crucial aspect of these systems. The inherent unreliability of the underlying components is one of the reasons why no system has been established as a standard storage solution for clusters yet. This thesis sets out the architecture and prototype implementation of a novel distributed mass storage system for commodity off-the-shelf clusters and addresses the issue of the unreliable constituent components. The key concept of the presented system is the conversion of the local hard disk drive of a cluster node into a reliable device while preserving the block device interface. By the deployment of sophisticated erasure-correcting codes, the system allows the adjustment of the number of tolerable failures and thus the overall reliability. In addition, the applied data layout considers the access behaviour of a broad range of applications and minimizes the number of required network transactions. Extensive measurements and functionality tests of the prototype, both stand-alone and in conjunction with local or distributed file systems, show the validity of the concept.

[1]  John Galletly,et al.  In Search of Clusters:982Gregory F. Pfister. In Search of Clusters: The Coming Battle in Lowly Parallel Computing. Prentice‐Hall, 1995. xxiv + 415 pp, ISBN: 0‐13‐437625‐0: The Coming Battle in Lowly Parallel Computing , 1998 .

[2]  Alan F. Benner Fibre Channel: Gigabit Communications and I/O for Computer Networks , 1995 .

[3]  Timm M. Steinbeck,et al.  A modular and fault-tolerant data transport framework , 2004, ArXiv.

[4]  Peter T. Breuer,et al.  Fault-tolerant distributed mass storage for LHC computing , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[5]  David J. DeWitt,et al.  Chained declustering: a new availability strategy for multiprocessor database machines , 1990, [1990] Proceedings. Sixth International Conference on Data Engineering.

[6]  Elwyn R. Berlekamp,et al.  Algebraic coding theory , 1984, McGraw-Hill series in systems science.

[7]  James L. Massey,et al.  Shift-register synthesis and BCH decoding , 1969, IEEE Trans. Inf. Theory.

[8]  Binasch,et al.  Enhanced magnetoresistance in layered magnetic structures with antiferromagnetic interlayer exchange. , 1989, Physical review. B, Condensed matter.

[9]  Robert G. Gallager,et al.  Low-density parity-check codes , 1962, IRE Trans. Inf. Theory.

[10]  Michael Mitzenmacher,et al.  Analysis of random processes via And-Or tree evaluation , 1998, SODA '98.

[11]  Stéphane Bressan,et al.  Introduction to Database Systems , 2005 .

[12]  Peter F. Corbett,et al.  The Direct Access File System , 2003, FAST.

[13]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[14]  Edward Grochowski,et al.  Technological impact of magnetic hard disk drives on storage systems , 2003, IBM Syst. J..

[15]  Scott A. Brandt,et al.  Reliability mechanisms for very large storage systems , 2003, 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings..

[16]  Richard W. Watson,et al.  The parallel I/O architecture of the high-performance storage system (HPSS) , 1995, Proceedings of IEEE 14th Symposium on Mass Storage Systems.

[17]  W. W. Peterson,et al.  Encoding and error-correction procedures for the Bose-Chaudhuri codes , 1960, IRE Trans. Inf. Theory.

[18]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[19]  Hai Jin,et al.  Orthogonal Striping and Mirroring in Distributed RAID for I/O-Centric Cluster Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[21]  Amnon Barak,et al.  The MOSIX Distributed Operating System , 1993, Lecture Notes in Computer Science.

[22]  William H. Press,et al.  Numerical recipes in C , 2002 .

[23]  Peter T. Breuer,et al.  Intelligent Networked Software RAID , 2005, Parallel and Distributed Computing and Networks.

[24]  Dwijendra K. Ray-Chaudhuri,et al.  Binary mixture flow with free energy lattice Boltzmann methods , 2022, arXiv.org.

[25]  Gordon Bell,et al.  Ethernet: Distributed Packet Switching for Local Computer Networks , 1976 .

[26]  M.H. Kryder Future trends in magnetic storage technology , 2003, Joint NAPMRC 2003. Digest of Technical Papers.

[27]  Kishor S. Trivedi,et al.  Reliability Analysis of Redundant Arrays of Inexpensive Disks , 1993, J. Parallel Distributed Comput..

[28]  Giuseppe Ciaccio Optimal Communication Performance on Fast Ethernet with GAMMA , 1998, IPPS/SPDP Workshops.

[29]  Ying Ding,et al.  Note: Correction to the 1997 tutorial on Reed–Solomon coding , 2005, Softw. Pract. Exp..

[30]  Dan Walsh,et al.  Design and implementation of the Sun network filesystem , 1985, USENIX Conference Proceedings.

[31]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[32]  Christos Faloutsos,et al.  Active Disks for Large-Scale Data Processing , 2001, Computer.

[33]  Lior Amar,et al.  The MOSIX Parallel I/O System for Scalable I/O Performance , 2002, IASTED PDCS.

[34]  Hai Jin,et al.  Designing SSI clusters with hierarchical checkpointing and single I/O space , 1999, IEEE Concurr..

[35]  W. Weibull A Statistical Distribution Function of Wide Applicability , 1951 .

[36]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[37]  John May,et al.  Parallel I/O for High Performance Computing , 2000 .

[38]  Michael S. Okun,et al.  Atomic writes for data integrity and consistency in shared storage devices for clusters , 2002 .

[39]  H. Tilsner,et al.  A Control Software for the ALICE High Level Trigger , 2005 .

[40]  Antony I. T. Rowstron,et al.  PAST: a large-scale, persistent peer-to-peer storage utility , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[41]  S. Wicker Error Control Systems for Digital Communication and Storage , 1994 .

[42]  Vaidy S. Sunderam,et al.  PVM: A Framework for Parallel Distributed Computing , 1990, Concurr. Pract. Exp..

[43]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[44]  Jehoshua Bruck,et al.  EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures , 1995, IEEE Trans. Computers.

[45]  Robert Latham,et al.  PVFS: a parallel file system , 2006, SC.

[46]  F. Lemmermeyer Error-correcting Codes , 2005 .

[47]  Alexander Reinefeld,et al.  How to build a high-performance compute cluster for the Grid , 2001, Proceedings International Conference on Parallel Processing Workshops.

[48]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[49]  David A. Patterson,et al.  Tertiary Disk: Large Scale Distributed Storage , 1998 .

[50]  Chandramohan A. Thekkath,et al.  Frangipani: a scalable distributed file system , 1997, SOSP.

[51]  A. L. Narasimha Reddy,et al.  MVSS: An Active Storage Architecture , 2003, IEEE Trans. Parallel Distributed Syst..

[52]  V. Pless Introduction to the Theory of Error-Correcting Codes , 1991 .

[53]  Michael Stonebraker,et al.  Operating system support for database management , 1981, CACM.

[54]  GhemawatSanjay,et al.  The Google file system , 2003 .

[55]  Giovanni Chiola,et al.  Using a Gigabit Ethernet cluster as a distributed disk array with multiple fault tolerance , 2003, 28th Annual IEEE International Conference on Local Computer Networks, 2003. LCN '03. Proceedings..

[56]  Robert B. Ross,et al.  Using MPI-2: Advanced Features of the Message Passing Interface , 2003, CLUSTER.

[57]  Moon Kyou Song,et al.  An efficient recursive cell architecture of modified Euclid's algorithm for decoding Reed-Solomon codes , 2002, IEEE Trans. Consumer Electron..

[58]  Robert J. T. Morris,et al.  The evolution of storage systems , 2003, IBM Syst. J..

[59]  Daniel Pierre Bovet,et al.  Understanding the Linux Kernel , 2000 .

[60]  Dominique Givord,et al.  Beating the superparamagnetic limit with exchange bias , 2003, Nature.

[61]  Rajkumar Buyya,et al.  Cluster computing: the commodity supercomputer , 1999, Softw. Pract. Exp..

[62]  T. C. Cheng,et al.  A new decoding algorithm for correcting both erasures and errors of Reed-Solomon codes , 2003, IEEE Trans. Commun..

[63]  Wu-chun Feng,et al.  The Quadrics Network: High-Performance Clustering Technology , 2002, IEEE Micro.

[64]  Chandramohan A. Thekkath,et al.  Petal: distributed virtual disks , 1996, ASPLOS VII.

[65]  G. F. Pfister The varieties of single system image , 1993, Proceedings 1993 IEEE Workshop on Advances in Parallel and Distributed Systems.

[66]  N. Zierler,et al.  A Class of Error-Correcting Codes in $p^m $ Symbols , 1961 .

[67]  Kevin T. Phelps,et al.  Coding Theory and Cryptography : The Essentials , 2022 .

[68]  James S. Plank A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems , 1997 .

[69]  Jean-Philippe Baud,et al.  CASTOR status and evolution , 2003, ArXiv.

[70]  R. Wells Applied Coding and Information Theory for Engineers , 1998 .

[71]  David A. Patterson,et al.  An Analysis of Error Behaviour in a Large Storage System , 1999 .

[72]  Sebastian Kalcher Optimization of a distributed fault-tolerant mass storage system for clusters , 2004 .

[73]  D. A. Thompson,et al.  The Future of Magnetic Data Storage Technology , 2000 .

[74]  C. Ross Patterned Magnetic Recording Media , 2001 .

[75]  Xiao Qin,et al.  Improved read performance in a cost-effective, fault-tolerant parallel virtual file system (CEFT-PVFS) , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[76]  Thomas E. Anderson,et al.  xFS: a wide area mass storage file system , 1993, Proceedings of IEEE 4th Workshop on Workstation Operating Systems. WWOS-III.