Unpredictability in COTS-based systems often manifests as occasional instances of uncontrollably-high response times. A particular category of COTS systems, fault-tolerant (FT) middleware, is used in critical enterprise and embedded applications where predictability is of paramount importance. Our prior empirical study, which used a client-server microbenchmark, suggested that hard bounds for the maximum latency are hard to establish a priori, but that the unpredictability may be confined to less than 1% of the requests. In this paper, we present empirical data, from 7 different three-tier, FT-middleware applications, that shows strong evidence supporting this "magical 1%" hypothesis. We conducted a controlled experiment with 7 teams of students from a graduate-level course at Carnegie Mellon University. Each team, starting from a common three-tier architecture, independently implemented and evaluated an original application using middleware (either CORBA or EJB) and a custom-implemented fault-tolerance mechanism (relying on either state-machine or primary-backup replication) for the middle-tier server. This experiment shows that unpredictability may not be avoidable, even in the absence of faults, and that, in some cases, the random latency outliers are larger than the time needed to recover from a fault. The data also reveals a statistically-significant result that, across all 7 applications, unpredictability is confined to the highest 1% of the recorded end-to-end latencies and is not correlated with the request rate, the size of messages exchanged or the number of clients. This suggests that strict predictability is hard to achieve in FT-middleware systems and that developers of critical FT applications should focus on guaranteeing bounds for statistical measures, such as the 99th percentile of the latency.
[1]
Fred B. Schneider,et al.
Implementing fault-tolerant services using the state machine approach: a tutorial
,
1990,
CSUR.
[2]
Fred B. Schneider,et al.
The primary-backup approach
,
1993
.
[3]
Asit Dan,et al.
Web services agreement specification (ws-agreement)
,
2004
.
[4]
Aniruddha S. Gokhale,et al.
CCMPerf: A Benchmarking Tool for CORBA Component Model Implementations
,
2004,
Proceedings. RTAS 2004. 10th IEEE Real-Time and Embedded Technology and Applications Symposium, 2004..
[5]
Tudor Dumitras,et al.
Fault-Tolerant Middleware and the Magical 1%
,
2005,
Middleware.
[6]
Tudor Dumitras,et al.
Architecting and Implementing Versatile Dependability
,
2004,
WADS.
[7]
Bettina Kemme,et al.
Fault-tolerance for stateful application servers in the presence of advanced transactions patterns
,
2005,
24th IEEE Symposium on Reliable Distributed Systems (SRDS'05).
[8]
Louise E. Moser,et al.
End-to-end latency of a fault-tolerant CORBA infrastructure
,
2006,
Perform. Evaluation.
[9]
Edward D. Lazowska,et al.
Quantitative system performance - computer system analysis using queueing network models
,
1983,
Int. CMG Conference.
[10]
Tudor Dumitras,et al.
MEAD: support for Real‐Time Fault‐Tolerant CORBA
,
2005,
Concurr. Pract. Exp..
[11]
Sape Mullender,et al.
Distributed systems
,
1989
.
[12]
Priya Narasimhan,et al.
Experiences, strategies, and challenges in building fault-tolerant CORBA systems
,
2004,
IEEE Transactions on Computers.