Got predictability?: experiences with fault-tolerant middleware

Unpredictability in COTS-based systems often manifests as occasional instances of uncontrollably-high response times. A particular category of COTS systems, fault-tolerant (FT) middleware, is used in critical enterprise and embedded applications where predictability is of paramount importance. Our prior empirical study, which used a client-server microbenchmark, suggested that hard bounds for the maximum latency are hard to establish a priori, but that the unpredictability may be confined to less than 1% of the requests. In this paper, we present empirical data, from 7 different three-tier, FT-middleware applications, that shows strong evidence supporting this "magical 1%" hypothesis. We conducted a controlled experiment with 7 teams of students from a graduate-level course at Carnegie Mellon University. Each team, starting from a common three-tier architecture, independently implemented and evaluated an original application using middleware (either CORBA or EJB) and a custom-implemented fault-tolerance mechanism (relying on either state-machine or primary-backup replication) for the middle-tier server. This experiment shows that unpredictability may not be avoidable, even in the absence of faults, and that, in some cases, the random latency outliers are larger than the time needed to recover from a fault. The data also reveals a statistically-significant result that, across all 7 applications, unpredictability is confined to the highest 1% of the recorded end-to-end latencies and is not correlated with the request rate, the size of messages exchanged or the number of clients. This suggests that strict predictability is hard to achieve in FT-middleware systems and that developers of critical FT applications should focus on guaranteeing bounds for statistical measures, such as the 99th percentile of the latency.