Guest Editorial on Dependable Distributed Systems

With the continuing development of high-performance distributed systems, such as computational clusters and grids, greater levels of performance and cost effectiveness are being realized. New processor and computer system architectures provide the basis for achieving increased computational performance. High-speed network protocols and services reduce and hide the communication overhead. New algorithms and applications are tuned for these systems to achieve the maximum efficiency, speedup, and scalability. Unfortunately, the ability to achieve higher levels of node, network, and system dependability can be an even greater challenge than attaining higher performance. As systems become larger and more complex, the challenge for attaining dependable distributed systems only escalates. This special issue of the Cluster Computing journal focuses on critical issues in the design and analysis of dependable middleware and robust algorithms for distributed systems. While the four papers included span research activities and developments in both mission-critical environments and scientific computing, they are tied together by common threads such as efficiency of failure detection, overhead introduced by fault tolerance, and ease of use for the application (i.e., intrusiveness). The first three papers focus on the incorporation of fault tolerance into existing middleware frameworks that are already in common use. The final paper demonstrates the use of similar middleware to build a fault-tolerant algorithm. The first paper featured is an invited paper, “MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant MessagePassing Middleware”, by Batchu, Dandass, Skjellum and Beddhu from Mississippi State University. This paper is based on the premise that the fault-tolerant behavior of message-passing middleware can be optimized based on application execution models. The authors propose a classification scheme for MPI-based applications, common in scientific computing, and present an MPI implementation that is tailored to the execution models proposed in that scheme. The experimental results demonstrate the promise of this approach to minimize the amount of overhead required to achieve fault tolerance in the message-passing layer. The second paper, an invited paper entitled “Design and Implementation of a Pluggable Fault-Tolerant CORBA Infrastructure” by Zhao, Moser, and Melliar-Smith from the University of California at Santa Barbara, moves into the realm of CORBA. The authors explore the use of a pluggable protocols framework to introduce fault tolerance into CORBA applications. This approach allows them to embed fault tolerance in the ORB, thus reducing the difficulty of managing its state and minimizing the intrusion into the application. They describe their design and implementation, which is compliant with the FT CORBA specification, and they present experimental results that demonstrate its effectiveness. The third paper, “Towards Real-Time Fault-Tolerant CORBA Middleware” by Gokhale, Natarajan, and Schmidt from Vanderbilt University and Cross from Lockheed Martin, extends the CORBA theme. In this paper, the authors delve into the complex relationship between predictability and dependability in mission-critical commercial or military distributed real-time embedded systems. In particular, they examine the deficiencies of conventional fault-tolerant CORBA implementations in real-time CORBA applications. The authors present an alternative based on semi-active replication, and their experimental results and analysis demonstrate the promise of this approach for real-time systems. The final paper featured is “Component Object Based Single System Image for Dependable Implementation of Genetic Programming on Clusters” by Tanev of ATR Human Information Science Laboratories, Uozumi of Muroran Institute of Technology, and Akhmetov of Mobile Multimedia Business Headquarters. Rather than focusing solely on the middleware, as the previous papers have, these authors describe the use of dependable middleware to create a robust genetic programming algorithm. In particular, they use a distributed component-object model based on a single-system image, combined with the inherent parallelism of genetic programming algorithms, in order to achieve their goal. By taking this approach, the authors are able to construct an algorithm that suffers only very minor performance degradation in the event of a failure. As the reader can surely imagine, any dependable system will require participation from the hardware architecture, the middleware, and the application. When combined with hardware features such as hot-swappable disks and redundant or reconfigurable network routing, there is no doubt that the middleware and algorithm work described in the four papers of this special issue represents essential contributions to dependability of all aspects of a distributed system.