论文信息 - ClusterRAID - architecture and prototype of a distributed fault-tolerant mass storage system for clusters

ClusterRAID - architecture and prototype of a distributed fault-tolerant mass storage system for clusters

During the past few years clusters built from commodity off-the-shelf (COTS) components have emerged as the predominant supercomputer architecture. Typically comprising a collection of standard PCs or workstations and an interconnection network, they have replaced the traditionally used integrated systems due to their better price/performance ratio. As paradigms shift from mere computing intensive to I/O intensive applications, mass storage solutions for cluster installations become a more and more crucial aspect of these systems. The inherent unreliability of the underlying components is one of the reasons why no system has been established as a standard storage solution for clusters yet. This thesis sets out the architecture and prototype implementation of a novel distributed mass storage system for commodity off-the-shelf clusters and addresses the issue of the unreliable constituent components. The key concept of the presented system is the conversion of the local hard disk drive of a cluster node into a reliable device while preserving the block device interface. By the deployment of sophisticated erasure-correcting codes, the system allows the adjustment of the number of tolerable failures and thus the overall reliability. In addition, the applied data layout considers the access behaviour of a broad range of applications and minimizes the number of required network transactions. Extensive measurements and functionality tests of the prototype, both stand-alone and in conjunction with local or distributed file systems, show the validity of the concept.

Arne Wiebalck | A. Wiebalck

[1] John Galletly,et al. In Search of Clusters:982Gregory F. Pfister. In Search of Clusters: The Coming Battle in Lowly Parallel Computing. Prentice‐Hall, 1995. xxiv + 415 pp, ISBN: 0‐13‐437625‐0: The Coming Battle in Lowly Parallel Computing , 1998 .

[2] Alan F. Benner. Fibre Channel: Gigabit Communications and I/O for Computer Networks , 1995 .

[3] Timm M. Steinbeck,et al. A modular and fault-tolerant data transport framework , 2004, ArXiv.

[4] Peter T. Breuer,et al. Fault-tolerant distributed mass storage for LHC computing , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[5] David J. DeWitt,et al. Chained declustering: a new availability strategy for multiprocessor database machines , 1990, [1990] Proceedings. Sixth International Conference on Data Engineering.

[6] Elwyn R. Berlekamp,et al. Algebraic coding theory , 1984, McGraw-Hill series in systems science.

[7] James L. Massey,et al. Shift-register synthesis and BCH decoding , 1969, IEEE Trans. Inf. Theory.

[8] Binasch,et al. Enhanced magnetoresistance in layered magnetic structures with antiferromagnetic interlayer exchange. , 1989, Physical review. B, Condensed matter.

[9] Robert G. Gallager,et al. Low-density parity-check codes , 1962, IRE Trans. Inf. Theory.

[10] Michael Mitzenmacher,et al. Analysis of random processes via And-Or tree evaluation , 1998, SODA '98.

[11] Stéphane Bressan,et al. Introduction to Database Systems , 2005 .

[12] Peter F. Corbett,et al. The Direct Access File System , 2003, FAST.

[13] Richard W. Hamming,et al. Error detecting and error correcting codes , 1950 .

[14] Edward Grochowski,et al. Technological impact of magnetic hard disk drives on storage systems , 2003, IBM Syst. J..

[15] Scott A. Brandt,et al. Reliability mechanisms for very large storage systems , 2003, 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings..

[16] Richard W. Watson,et al. The parallel I/O architecture of the high-performance storage system (HPSS) , 1995, Proceedings of IEEE 14th Symposium on Mass Storage Systems.

[17] W. W. Peterson,et al. Encoding and error-correction procedures for the Bose-Chaudhuri codes , 1960, IRE Trans. Inf. Theory.

[18] Frank B. Schmuck,et al. GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[19] Hai Jin,et al. Orthogonal Striping and Mirroring in Distributed RAID for I/O-Centric Cluster Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[21] Amnon Barak,et al. The MOSIX Distributed Operating System , 1993, Lecture Notes in Computer Science.

[22] William H. Press,et al. Numerical recipes in C , 2002 .

[23] Peter T. Breuer,et al. Intelligent Networked Software RAID , 2005, Parallel and Distributed Computing and Networks.

[24] Dwijendra K. Ray-Chaudhuri,et al. Binary mixture flow with free energy lattice Boltzmann methods , 2022, arXiv.org.

[25] Gordon Bell,et al. Ethernet: Distributed Packet Switching for Local Computer Networks , 1976 .

[26] M.H. Kryder. Future trends in magnetic storage technology , 2003, Joint NAPMRC 2003. Digest of Technical Papers.

[27] Kishor S. Trivedi,et al. Reliability Analysis of Redundant Arrays of Inexpensive Disks , 1993, J. Parallel Distributed Comput..

[28] Giuseppe Ciaccio. Optimal Communication Performance on Fast Ethernet with GAMMA , 1998, IPPS/SPDP Workshops.

[29] Ying Ding,et al. Note: Correction to the 1997 tutorial on Reed–Solomon coding , 2005, Softw. Pract. Exp..

[30] Dan Walsh,et al. Design and implementation of the Sun network filesystem , 1985, USENIX Conference Proceedings.

[31] David R. Karger,et al. Wide-area cooperative storage with CFS , 2001, SOSP.

[32] Christos Faloutsos,et al. Active Disks for Large-Scale Data Processing , 2001, Computer.

[33] Lior Amar,et al. The MOSIX Parallel I/O System for Scalable I/O Performance , 2002, IASTED PDCS.

[34] Hai Jin,et al. Designing SSI clusters with hierarchical checkpointing and single I/O space , 1999, IEEE Concurr..

[35] W. Weibull. A Statistical Distribution Function of Wide Applicability , 1951 .

[36] Ben Y. Zhao,et al. OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[37] John May,et al. Parallel I/O for High Performance Computing , 2000 .

[38] Michael S. Okun,et al. Atomic writes for data integrity and consistency in shared storage devices for clusters , 2002 .

[39] H. Tilsner,et al. A Control Software for the ALICE High Level Trigger , 2005 .

[40] Antony I. T. Rowstron,et al. PAST: a large-scale, persistent peer-to-peer storage utility , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[41] S. Wicker. Error Control Systems for Digital Communication and Storage , 1994 .

[42] Vaidy S. Sunderam,et al. PVM: A Framework for Parallel Distributed Computing , 1990, Concurr. Pract. Exp..

[43] Randy H. Katz,et al. A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[44] Jehoshua Bruck,et al. EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures , 1995, IEEE Trans. Computers.

[45] Robert Latham,et al. PVFS: a parallel file system , 2006, SC.

[46] F. Lemmermeyer. Error-correcting Codes , 2005 .

[47] Alexander Reinefeld,et al. How to build a high-performance compute cluster for the Grid , 2001, Proceedings International Conference on Parallel Processing Workshops.

[48] Garth A. Gibson,et al. RAID: high-performance, reliable secondary storage , 1994, CSUR.

[49] David A. Patterson,et al. Tertiary Disk: Large Scale Distributed Storage , 1998 .

[50] Chandramohan A. Thekkath,et al. Frangipani: a scalable distributed file system , 1997, SOSP.

[51] A. L. Narasimha Reddy,et al. MVSS: An Active Storage Architecture , 2003, IEEE Trans. Parallel Distributed Syst..

[52] V. Pless. Introduction to the Theory of Error-Correcting Codes , 1991 .

[53] Michael Stonebraker,et al. Operating system support for database management , 1981, CACM.

[54] GhemawatSanjay,et al. The Google file system , 2003 .

[55] Giovanni Chiola,et al. Using a Gigabit Ethernet cluster as a distributed disk array with multiple fault tolerance , 2003, 28th Annual IEEE International Conference on Local Computer Networks, 2003. LCN '03. Proceedings..

[56] Robert B. Ross,et al. Using MPI-2: Advanced Features of the Message Passing Interface , 2003, CLUSTER.

[57] Moon Kyou Song,et al. An efficient recursive cell architecture of modified Euclid's algorithm for decoding Reed-Solomon codes , 2002, IEEE Trans. Consumer Electron..

[58] Robert J. T. Morris,et al. The evolution of storage systems , 2003, IBM Syst. J..

[59] Daniel Pierre Bovet,et al. Understanding the Linux Kernel , 2000 .

[60] Dominique Givord,et al. Beating the superparamagnetic limit with exchange bias , 2003, Nature.

[61] Rajkumar Buyya,et al. Cluster computing: the commodity supercomputer , 1999, Softw. Pract. Exp..

[62] T. C. Cheng,et al. A new decoding algorithm for correcting both erasures and errors of Reed-Solomon codes , 2003, IEEE Trans. Commun..

[63] Wu-chun Feng,et al. The Quadrics Network: High-Performance Clustering Technology , 2002, IEEE Micro.

[64] Chandramohan A. Thekkath,et al. Petal: distributed virtual disks , 1996, ASPLOS VII.

[65] G. F. Pfister. The varieties of single system image , 1993, Proceedings 1993 IEEE Workshop on Advances in Parallel and Distributed Systems.

[66] N. Zierler,et al. A Class of Error-Correcting Codes in $p^m $ Symbols , 1961 .

[67] Kevin T. Phelps,et al. Coding Theory and Cryptography : The Essentials , 2022 .

[68] James S. Plank. A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems , 1997 .

[69] Jean-Philippe Baud,et al. CASTOR status and evolution , 2003, ArXiv.

[70] R. Wells. Applied Coding and Information Theory for Engineers , 1998 .

[71] David A. Patterson,et al. An Analysis of Error Behaviour in a Large Storage System , 1999 .

[72] Sebastian Kalcher. Optimization of a distributed fault-tolerant mass storage system for clusters , 2004 .

[73] D. A. Thompson,et al. The Future of Magnetic Data Storage Technology , 2000 .

[74] C. Ross. Patterned Magnetic Recording Media , 2001 .

[75] Xiao Qin,et al. Improved read performance in a cost-effective, fault-tolerant parallel virtual file system (CEFT-PVFS) , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[76] Thomas E. Anderson,et al. xFS: a wide area mass storage file system , 1993, Proceedings of IEEE 4th Workshop on Workstation Operating Systems. WWOS-III.