Reliability, Availability, and Serviceability of IBM Computer Systems: A Quarter Century of Progress

Computer systems have achieved significant progress in the areas of technology, performance, capability, and RAS (reliability/availability/serviceability) during the last quarter century. In this papers, we shall review the advances of IBM computer systems in the RAS area. This progress has for the most part been evolutionary; however, in some cases it has been revolutionary. RAS developments have been driven primarily by technological advances and by increases in functional capability and complexity, but RAS considerations have also played a leading role and have improved technological and functional capability. The paper briefly reviews the progress of computer technology. It points out how IBM has maintained or improved its systems RAS capabilities in the face of the greatly increased number of components and system complexity by improved system recovery and serviceability capability, as well as by basic improvements in intrinsic component failure rate. The paper also covers the CPU, tape, and disk areas and shows how RAS improvements in these areas have been significant. The main objective is to provide a comprehensive view of significant developments in the RAS characteristics of IBM computer systems over the past twenty-five years.

[1]  W. D. Winger,et al.  The design of the IBM type 702 system , 1956, Transactions of the American Institute of Electrical Engineers, Part I: Communication and Electronics.

[2]  Werner Buchholz,et al.  Planning a Computer System: Project Stretch , 1962 .

[3]  Charles T. Davies,et al.  Recovery semantics for a DB/DC system , 1973, ACM Annual Conference.

[4]  Edward H. Sussenguth,et al.  IBM Data Communications: A Quarter Century of Evolution an Progress , 1981, IBM J. Res. Dev..

[5]  Se June Hong,et al.  Optimal Rectangular Code for High Density Magnetic Tapes , 1974, IBM J. Res. Dev..

[6]  Samuel E. James Evolution of Real-Time Computer Systems for Mannet Spaceflight , 1981, IBM J. Res. Dev..

[7]  D. T. Brown,et al.  Error correction for IBM 800-bit-per-inch magnetic tape , 1970 .

[8]  Frederick F. Sellers,et al.  Error detecting logic for digital computers , 1968 .

[9]  Ralph J. Preiss,et al.  Design of Serviceability Features for the IBM System/360 , 1964, IBM J. Res. Dev..

[10]  P. Franaszek Sequence-state methods for run-length-limited coding , 1970 .

[11]  Thomas W. Williams,et al.  A logic design structure for LSI testability , 1977, DAC '77.

[12]  John P. Harris,et al.  Innovations in the Design of Magnetic Tape Subsystems , 1981, IBM J. Res. Dev..

[13]  Arvind M. Patel Error Recovery Scheme for the IBM 3850 Mass Storage System , 1980, IBM J. Res. Dev..

[14]  R. R. Everett,et al.  SAGE: a data-processing system for air defense , 1957, IRE-ACM-AIEE Computer Conference.

[15]  Lawrence A. Bjork Recovery scenario for a DB/DC system , 1973, ACM Annual Conference.

[16]  F. J. Hackl,et al.  An integrated approach to automated computer maintenance , 1965, SWCT.

[17]  Julius T. Tou,et al.  Application of Error-Correcting Codes in Computer Reliability Studies , 1969 .

[18]  Philip F. Olsen,et al.  Real-Time Systems for Federal Applications: A Review of Significant Technological Developments , 1981, IBM J. Res. Dev..

[19]  D. C. Bossen b-adjacent error correction , 1970 .

[20]  Alan N. Higgins Error recovery through programming , 1968, AFIPS '68 (Fall, part I).

[21]  F. F. Sellers,et al.  Analyzing Errors with the Boolean Difference , 1968, IEEE Transactions on Computers.

[22]  K. Y. Sih,et al.  Serial-to-Parallel Transformation of Linear-Feedback Shift-Register Circuits , 1964, IEEE Trans. Electron. Comput..

[23]  M. Y. Hsiao,et al.  A class of optimal minimum odd-weight-column SEC-DED codes , 1970 .

[24]  M. Ball,et al.  Effects and detection of intermittent failures in digital systems , 1969, AFIPS '69 (Fall).

[25]  John J. Dent Diagnostic engineering requirements , 1968, AFIPS '68 (Spring).

[26]  J. von Neumann,et al.  Probabilistic Logic and the Synthesis of Reliable Organisms from Unreliable Components , 1956 .

[27]  J. Paul Roth,et al.  Diagnosis of automata failures: a calculus and a method , 1966 .

[28]  W. R. Plugge,et al.  American Airlines' "Sabre" electronic reservations system , 1961, IRE-AIEE-ACM '61 (Western).

[29]  George R. Santana,et al.  A Quarter Century of Disk File Innovation , 1981, IBM J. Res. Dev..

[30]  Allen M. Johnson The Microdiagnostics for the IBM System 360 Model 30 , 1971, IEEE Transactions on Computers.