论文信息 - THE STAR (SELF-TESTING-AND-REPAIRING) COMPUTER: AN INVESTIGATION OF THE THEORY AND PRACTICE OF FAULT

THE STAR (SELF-TESTING-AND-REPAIRING) COMPUTER: AN INVESTIGATION OF THE THEORY AND PRACTICE OF FAULT

This paper presents a n overview of the theore t ica l r e s u l t s and design experience obtained i n a continuing investigation of f a u l t t o l e r a n t computing which i s being conducted a t the J e t Propulsion Laboratory. In i t i a l studies (1961-65) [ l] led t o the conclusion t h a t dynamic [2] (a l so called "standby") redundancy offered the grea tes t promise in the design of f a u l t t o l e r a n t d i g i t a l computer systems. The dynamic redundancy approach requires a two-step procedure f o r the eliminat i o n of a f a u l t : determined; second a cor rec t ive action i s taken (e .g . , replacement of f a i l e d u n i t , r epe t i t ion of program, reconfiguration of systems, e t c . ) . t he dynamic approach is s t a t i c [2] ("miasking") redundancy, which was already being u t i l i z e d i n e x i s t i n g component-redundant [3,4] and triple-modular redundant (TMR) [4,5,6] computers. The s t a t i c method depends on the permanently connected s t r u c t u r e of the computer t o mask the occurrence of f a u l t s and i s based on the assumption t h a t f a u l t s a re s t a t i s t i c a l l y independent events a f fec t ing s ingle components o r log ic elements [61. Early ana ly t ic s tud ies of dynamic redundancy w i t h idealized ser ies -para l le l system models [7,8,9,10] indicated t h a t mean l i f e gains of an order of magnitude and more over a non-redundant system could be expected from dynamically redundant systems w i t h standby spares replacing f a i l e d units. This gain compared favorably w i t h the mean l i f e gain of l e s s t h a n 2 i n the typica l TMR systems. of the dynamic over the s t a t i c redundaincy were [ l , l l ] :

[1] J. E. Long. To the outer planets. , 1969 .

[2] Betty J. Flehinger,et al. Reliability Improvement through Redundancy at Various System Levels , 1958, IBM J. Res. Dev..

[3] Algirdas Avizienis,et al. Arithmetic Error Codes: Cost and Effectiveness Studies for Application in Digital System Design , 1971, IEEE Transactions on Computers.

[4] Algirdas Avizienis,et al. Design of fault-tolerant computers , 1967, AFIPS '67 (Fall).

[5] F. P. Mathur,et al. Automatic maintenance of aerospace computers and spacecraft information and control systems. , 1969 .

[6] Algirdas Avizienis,et al. An experimental self-repairing computer , 1968, IFIP Congress.

[7] Algirdas Avizienis,et al. Reliability analysis and architecture of a hybrid-redundant digital system: generalized triple modular redundancy with self-repair , 1970, AFIPS '70 (Spring).

[8] Ralph E. Kuehn. Computer Redundancy: Design, Performance, and Future , 1969 .

[9] Robert E. Lyons,et al. The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..

[10] W. C. Carter,et al. Reliability modeling techniques for self-repairing computer systems , 1969, ACM '69.

[11] Thomas B. Lewis. Primary Processor and Data Storage Equipment for the Orbiting Astronomical Observatory , 1963, IEEE Trans. Electron. Comput..