A Transactional Approach to Redundant Disk Array Implementation (CMU-CS-97-141)

Redundant disk arrays are a popular method of improving the dependability and performance of disk storage and an ever-increasing number of array architectures are being proposed to balance cost, performance, and dependability. Despite their differences, there is a great deal of commonality between these architectures; unfortunately, it appears that current implementations are not able to effectively exploit this commonality due to their ad hoc approach to error recovery. Such techniques rely upon a case-by-case analysis of errors, a manual process that is tedious and prone to mistakes. For each distinct error scenario, a unique procedure is implemented to remove the effects of the error and complete the affected operation. Unfortunately, this form of recovery is not easily extended because the analysis must be repeated as new array operations and architectures are introduced. Transaction-processing systems utilize logging techniques to mechanize the process of recovering from errors. However, the expense of guaranteeing that all operations can be undone from any point in their execution is too expensive to satisfy the performance and resource requirements of redundant disk arrays. This dissertation describes a novel programming abstraction and execution mechanism based upon transactions that simplifies implementation. Disk array algorithms are modeled as directed acyclic graphs: the nodes are actions such as "XOR" and the arcs represent data and control dependencies between them. Using this abstraction, we implemented eight array architectures in RAIDframe, a framework for prototyping disk arrays. Code reuse was consistently above 90%. The additional layers of abstraction did not affect the response time and throughput characteristics of RAIDframe; however, RAIDframe consumes 60% more CPU cycles than a hand-crafted non-redundant implementation. RAIDframe employs roll-away error recovery, a novel scheme for mechanizing the execution of disk array algorithms without requiring that all actions be undoable. A barrier is inserted into each algorithm: failures prior to the barrier result in rollback, relying upon undo information. Once the barrier is crossed, the algorithm rolls forward to completion, and undo records are unnecessary. Experiments revealed this approach to have identical performance to that of non-logging schemes.

[1]  Margo I. Seltzer,et al.  Disk Scheduling Revisited , 1990 .

[2]  John A. Kunze,et al.  A trace-driven analysis of the UNIX 4.2 BSD file system , 1985, SOSP '85.

[3]  Randy H. Katz,et al.  How reliable is a RAID? , 1989, Digest of Papers. COMPCON Spring 89. Thirty-Fourth IEEE Computer Society International Conference: Intellectual Leverage.

[4]  Leslie Lamport,et al.  The Byzantine generals , 1987 .

[5]  Tom W. Keller,et al.  A comparison of high-availability media recovery techniques , 1989, SIGMOD '89.

[6]  Jai Menon,et al.  The architecture of a fault-tolerant cached RAID controller , 1993, ISCA '93.

[7]  Liming Chen,et al.  N-VERSION PROGRAMMINC: A FAULT-TOLERANCE APPROACH TO RELlABlLlTY OF SOFTWARE OPERATlON , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[8]  Garth A. Gibson,et al.  Parity declustering for continuous operation in redundant disk arrays , 1992, ASPLOS V.

[9]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[10]  George Eckel Inside Windows NT , 1993 .

[11]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[12]  T. Anderson,et al.  An Evaluation of Software Fault Tolerance in a Practical System , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[13]  P. M. Melliar-Smith,et al.  A program structure for error detection and recovery , 1974, Symposium on Operating Systems.

[14]  David B. Lomet,et al.  Process structuring, synchronization, and recovery using atomic actions , 1977, Language Design for Reliable Software.

[15]  David T. Brown,et al.  Channel and Direct Access Device Architecture , 1972, IBM Syst. J..

[16]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[17]  J. von Neumann,et al.  Probabilistic Logic and the Synthesis of Reliable Organisms from Unreliable Components , 1956 .

[18]  E. Grochowski,et al.  Future trends in hard disk drives , 1996 .

[19]  Hugh M. Sierra,et al.  An Introduction to Direct Access Storage Devices , 1990 .

[20]  Alexander A. Stepanov,et al.  Mime: a high performance parallel storage device with strong recovery guarantees , 1997 .

[21]  Barry W. Boehm,et al.  Software Engineering Economics , 1993, IEEE Transactions on Software Engineering.

[22]  Randy H. Katz,et al.  Performance consequences of parity placement in disk arrays , 1991, ASPLOS IV.

[23]  William Collins,et al.  Los Alamos HPDS: high-speed data transfer , 1993, [1993] Proceedings Twelfth IEEE Symposium on Mass Storage systems.

[24]  E. Pugh,et al.  Storage hierarchies: Gaps, cliffs, and trends , 1971 .

[25]  Butler W. Lampson,et al.  Atomic Transactions , 1980, Advanced Course: Distributed Systems.

[26]  Robert Geist,et al.  A continuum of disk scheduling algorithms , 1987, TOCS.

[27]  Jim Zelenka,et al.  The Scotch parallel storage systems , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[28]  John Wilkes,et al.  An introduction to disk drive modeling , 1994, Computer.

[29]  Helen Custer,et al.  Inside Windows NT , 1992 .

[30]  Robert S. Swarz,et al.  Reliable Computer Systems: Design and Evaluation , 1992 .

[31]  Edward K. Lee Software and Performance Issues in the Implementation of a RAID Prototype , 1990 .

[32]  Stefan Savage,et al.  AFRAID - A Frequently Redundant Array of Independent Disks , 1996, USENIX Annual Technical Conference.

[33]  Brian Randell,et al.  Reliability Issues in Computing System Design , 1978, CSUR.

[34]  Garth A. Gibson,et al.  Parity logging disk arrays , 1994, TOCS.

[35]  Spencer W. Ng Crosshatch disk array for improved reliability and performance , 1994, ISCA '94.

[36]  Jim Zelenka,et al.  RAIDframe: rapid prototyping for disk arrays , 1996, SIGMETRICS '96.

[37]  Irving L. Traiger,et al.  The Recovery Manager of the System R Database Manager , 1981, CSUR.

[38]  Mandana Vaziri-Farahani Proving correctness of a controller algorithm for the RAID Level 5 system , 1996 .

[39]  Andreas Reuter,et al.  Principles of transaction-oriented database recovery , 1983, CSUR.

[40]  Shivakumar Venkataraman,et al.  The TickerTAIP parallel RAID architecture , 1993, ISCA '93.

[41]  C. V. Ramamoorthy,et al.  Rollback and Recovery Strategies for Computer Programs , 1972, IEEE Transactions on Computers.

[42]  Garth A. Gibson Redundant disk arrays: Reliable, parallel secondary storage. Ph.D. Thesis , 1990 .

[43]  Joost Verhofstad,et al.  Recovery Techniques for Database Systems , 1978, CSUR.

[44]  John Wilkes,et al.  UNIX Disk Access Patterns , 1993, USENIX Winter.

[45]  Walter A. Burkhard,et al.  Disk array storage system reliability , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[46]  C. Wood,et al.  DASD trends: cost, performance, and form factor , 1993, Proc. IEEE.

[47]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[48]  David J. DeWitt,et al.  Chained declustering: a new availability strategy for multiprocessor database machines , 1990, [1990] Proceedings. Sixth International Conference on Data Engineering.

[49]  Masaru Kitsuregawa,et al.  Dynamic parity stripe reorganizations for RAID5 disk arrays , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[50]  Lawrence A. Bjork Generalized Audit Trail Requirements and Concepts for Data Base Applications , 1975, IBM Syst. J..

[51]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[52]  D. L. Parnas,et al.  On the criteria to be used in decomposing systems into modules , 1972, Software Pioneers.

[53]  A. Singh,et al.  Fault-tolerant systems , 1990, Computer.

[54]  Daniel M. Dias,et al.  Disk Mirroring with Alternating Deferred Updates , 1993, VLDB.

[55]  James J. Krivacska Computer Technology Review , 1985 .

[56]  Michelle Y. Kim,et al.  Synchronized Disk Interleaving , 1986, IEEE Transactions on Computers.

[57]  Garth A. Gibson,et al.  Backward Error Recovery in Redundant Disk Arrays , 1994, Int. CMG Conference.

[58]  Glenford J. Myers,et al.  Composite/structured design , 1978 .

[59]  Yale N. Patt,et al.  Disk arrays: high-performance, high-reliability storage subsystems , 1994, Computer.

[60]  Jim Zelenka,et al.  RAIDframe: A Rapid Prototyping Tool for RAID Systems (CMU-CS-97-142) , 1997 .

[61]  Hermann Kopetz,et al.  Fault tolerance, principles and practice , 1990 .

[62]  Dina Bitton,et al.  Disk Shadowing , 1988, VLDB.

[63]  Aviziens Fault-Tolerant Systems , 1976, IEEE Transactions on Computers.

[64]  David J. DeWitt,et al.  A performance study of three high availability data replication strategies , 1991, [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.

[65]  Jim Gray,et al.  Parity Striping of Disk Arrays: Low-Cost Reliable Storage with Acceptable Throughput , 1990, VLDB.

[66]  李幼升,et al.  Ph , 1989 .

[67]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[68]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[69]  David A. Patterson,et al.  Designing Disk Arrays for High Data Reliability , 1993, J. Parallel Distributed Comput..

[70]  Benjamin Arazi,et al.  A commonsense approach to the theory of error correcting codes , 1988, MIT Press Series in Computer Systems.

[71]  Nancy A. Lynch,et al.  Proving correctness of a controller algorithm for the RAID Level 5 System , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[72]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[73]  David A. Patterson,et al.  Maximizing performance in a striped disk array , 1990, ISCA '90.

[74]  Jai Menon,et al.  Floating Parity and Data Disk Arrays , 1993, J. Parallel Distributed Comput..

[75]  Butler W. Lampson,et al.  Crash Recovery in a Distributed Data Storage System , 1981 .

[76]  Ralph E. Kuehn Computer Redundancy: Design, Performance, and Future , 1969 .

[77]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[78]  Andrew R. Cherenson,et al.  The Sprite network operating system , 1988, Computer.

[79]  Daniel P. Siewiorek,et al.  Reliable computer systems - design and evaluation (3. ed.) , 1992 .

[80]  M. B. Friedman RAID keeps going and going and... [magnetic disk storage] , 1996 .

[81]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[82]  Cyril U. Orji,et al.  Distorted mirrors , 1991, [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.

[83]  Jim Zelenka,et al.  Informed prefetching and caching , 1995, SOSP.

[84]  Carl Staelin,et al.  Idleness is Not Sloth , 1995, USENIX.

[85]  Jehoshua Bruck,et al.  EVENODD: an optimal scheme for tolerating double disk failures in RAID architectures , 1994, ISCA '94.

[86]  Hector Garcia-Molina,et al.  Disk striping , 1986, 1986 IEEE Second International Conference on Data Engineering.