THE STAR (SELF-TESTING-AND-REPAIRING) COMPUTER: AN INVESTIGATION OF THE THEORY AND PRACTICE OF FAULT

This paper presents a n overview of the theore t ica l r e s u l t s and design experience obtained i n a continuing investigation of f a u l t t o l e r a n t computing which i s being conducted a t the J e t Propulsion Laboratory. In i t i a l studies (1961-65) [ l] led t o the conclusion t h a t dynamic [2] (a l so called "standby") redundancy offered the grea tes t promise in the design of f a u l t t o l e r a n t d i g i t a l computer systems. The dynamic redundancy approach requires a two-step procedure f o r the eliminat i o n of a f a u l t : determined; second a cor rec t ive action i s taken (e .g . , replacement of f a i l e d u n i t , r epe t i t ion of program, reconfiguration of systems, e t c . ) . t he dynamic approach is s t a t i c [2] ("miasking") redundancy, which was already being u t i l i z e d i n e x i s t i n g component-redundant [3,4] and triple-modular redundant (TMR) [4,5,6] computers. The s t a t i c method depends on the permanently connected s t r u c t u r e of the computer t o mask the occurrence of f a u l t s and i s based on the assumption t h a t f a u l t s a re s t a t i s t i c a l l y independent events a f fec t ing s ingle components o r log ic elements [61. Early ana ly t ic s tud ies of dynamic redundancy w i t h idealized ser ies -para l le l system models [7,8,9,10] indicated t h a t mean l i f e gains of an order of magnitude and more over a non-redundant system could be expected from dynamically redundant systems w i t h standby spares replacing f a i l e d units. This gain compared favorably w i t h the mean l i f e gain of l e s s t h a n 2 i n the typica l TMR systems. of the dynamic over the s t a t i c redundaincy were [ l , l l ] :