High-availability features

Session summary High availabihty, sometimes: referred to as fault tolerance, can be considered to be comprised of several classes of activities, e g.: fault detection, fault diagnosis, fault confinement, fault recovery. fault repair, fault reporting, and restart (if necessary). This topic has been the fc~cus of much research m centralized computer systems, and more recently m the context of distributed systems such as networks and distributed computers. The strong interest in high availability and fault tolerance in distributed systems stems not just from their inherently greater fault susceptibility in certain ways (e.g.. data inconsistency) but also from their potential for improved axailabflity over centralized systems (e.g.. physical isolation). However, in general the user of a distributed system takes neither of these perspectives: (s)he has a need which seems best filled by a distributed system, and must overcome availability obstacles and take advantage of availability opportunities. In this session, four invited speakers addressed the topic from different viewpoints. The first two speakers presented projects where the principal mechanism for achieving fault tolerance are atomic transactions supported m the kernel. The third speaker discussed language-based tools for dynamic reconfiguration of d~strtbuted systems. The last speaker presented the fault-tolerance aspects of a network-operating system based on actors messages, and ports. These four presentations are briefly synopsized below. ArchOS E. Douglas Jensen (Carnegie-_Mellon University. USA) outlined a large and long-term project performing research on "decentralized computers", in which a system-wide but physically replicated OS manages all the global resources through teams which negotiate, compromise, and reach a best-effort consensus based on inaccurate and incomplete information. This is supported by a general atomic transaction facility m each instance of the kernel, which provides "compound" nonseriahzable transactions on distributed objects and "failure safety" (both of which are supported by a new formal theory of consistency and correctness), as well as the conventional nested serlahzable transactions and failure atomicity whmh are special cases. Separate prototypes of the transaction kernel and a best-effort resource management kernel (initially confined to time-driven placement and scheduling of real-time processes) are expected to be operational on approximately ten Ethernet'ed Sun F, ticrosystems nodes before the end of the year. Each node is a multiprocessor to avoid OS processing from burdening the application processor; special-purpose OS support hardware is being designed. A complete global decentralized operating system named ArchOS is taking the unusual approach (for a research project) of proceeding through all the …