Survivable systems

W e consider here computer and communication systems with survivability requirements (i.e., systems that must continue to perform adequately in the face of various kinds of adversity). Fault-tolerant and nonstop (e.g., Tandem) systems are designed to survive specific types of hardware malfunctions. Secure systems are intended to withstand certain types of misuse--such as malicious denialof-service attacks that can impair functional survivability or diminish performance adversely. Survivable systems may need to be both fault tolerant and secure, e.g., if the perceived threats include hardware malfunction, malicious misuse, power failures, and electromagnetic (or other) interference. There may be hard real-time requirements as well. We summarize several illustrative past problems (some well known to readers of "Risks"), suggesting the rather pervasive nature of the survivability problem--with many diverse causes and potential effects. • On October 27, 1980, the ARPANET accidentally shut itself down globally. Collapse, analysis, and recovery took about four hours. The problem was due to a hardware design omission (no parity checking in memory), hardware failures (the coexistence of two bogus versions of a node status message resulting from memory errors), and generous algorithim design (overly permissive garbage colle(tion). Memory overflowed in each network node and the network became useless. • On January 15, 1990 the AT&T long-distance network suffered a nationwide blockage of most longdistance calling for 11 hours. The problem involved the impleraentation of a new recovery algorithm that caused Signaling System 7 switches to crash in response to the crash recovery o f a neighboring switch. This crash phenomenon propagated repeatedly throughout the entire network, ping-ponging back and forth. ISSIDE RISKS