Guest Editors' Introduction to Special Section on Asynchronous Real-Time Distributed Systems

ASYNCHRONOUS real-time distributed systems are emerging in many domains, including defense, space, financial markets, autonomy and artificial intelligence, telecommunication, and industrial automation for real-time control above the device-level. Such systems are fundamentally distinguished by the significant runtime uncertainties that are inherent in their application environment and system resource states. Another source of nondeterminism is that some events and state changes are apparently spontaneous to the computer system per se because their causal reasons are from outside the system. Consequently, it is difficult to postulate upper bounds on application workloads or distributions for failure occurrences for such systems that will always be respected at runtime. Thus, they violate the deterministic foundations of hard real-time theory that ensures that all timing constraints are always satisfied under deterministic postulations of application workloads, execution environment characteristics, and failure distributions. Asynchronous real-time distributed systems thus raise the fundamental, apparently contradicting, issue: “How to build timely systems that operate in the presence of uncertain timeliness?” This special section presents papers that answer this question by focusing on different, but fundamental problems in asynchronous real-time distributed computing systems. The section presents four papers that address fundamental problems, including uniform agreement, group communication, and group priority inversion. Furthermore, the section presents a paper that describes a generic architectural construct for asynchronous real-time distributed systems. From the papers, we find that two divergent schools of thought are emerging. The two schools of thought are divergent in that each school of thought contradicts the other. The first school of thought is the “measure-compareadapt” approach. Mishra and Fetzer and Wang, Anceaume, Brasileiro, Greve, and Hurfin show how group communication services can be constructed and the group priority inversion problem can be solved, respectively, in asynchronous real-time distributed systems using the Timed Asynchronous (TA) model. Furthermore, Verissimo and Casimiro show how asynchronous real-time distributed systems can be built using the Timely Computing Base (TCB) architectural construct. The TA model and the TCB construct are based on the principle that uncertainty in asynchronous real-time distributed systems can be countered by postulating upper bounds on delays for timing variables such as clock drift rates and end-to-end interprocess communications. Based on such postulates, and thus assuming a partially synchronous model, one can construct higher level services such as group communication services or network of TCB modules that can detect when timing failures occur at runtime. Fundamental to this belief is that such postulates on upper bounds on timing variables are respected most of the time in asynchronous realtime distributed systems, but clearly, not always. Upon detection of timing failures, one may employ some sort of an adaptation scheme to counter the failure. While Mishra and Fetzer is silent on how adaptation can be achieved as their focus is on how the group communication service itself can be constructed, Verissimo and Casimiro propose the notion of coverage stability, which provides a framework for runtime adaptation. Wang, Anceaume, Brasileiro, Greve, and Hurfin present a protocol for solving the group priority inversion problem that occurs in real-time distributed systems that perform actively replicated processing based on static priorities. Group priority inversion is an extension of the priority inversion problem that was originally studied in the context of single processor systems. Their protocol assumes the TA model that is equipped with failure detectors. The second school of thought is the “no runtime adaptation, but guaranteed safety” paradigm. Hermant and Le Lann subscribe to this divergent philosophy. They believe that the “measure-compare-adapt” school of thought cannot help in improving timeliness guarantees. This is due to 1) runtime uncertainties that will cause postulated upper bounds on timing variables to be violated and, thus, the fail-aware property (which is used to detect timing failures) itself is lost and 2) the difficulty in conducting accurate schedulability analysis, which is exacerbated by the need to account for the overhead of the measure-compare-adapt techniques for performing runtime adaptation. IEEE TRANSACTIONS ON COMPUTERS, VOL. 51, NO. 8, AUGUST 2002 881