Fault tolerant internet computing: Benchmarking and modelling trade-offs between availability, latency and consistency

Abstract The paper discusses our practical experience and theoretical results of investigating the impact of consistency on latency in distributed fault tolerant systems built over the Internet and clouds. We introduce a time-probabilistic failure model of distributed systems that employ the service-oriented paradigm for defining cooperation with clients over the Internet and clouds. The trade-offs between consistency, availability and latency are examined, as well as the role of the application timeout as the main determinant in the interplay between system availability and responsiveness. The model introduced heavily relies on collecting and analysing a large amount of data representing the probabilistic behaviour of such systems. The paper presents experimental results of measuring the response time in a distributed service-oriented system whose replicas are deployed at different Amazon EC2 location domains. These results clearly show that improvements in system consistency increase system latency, which is in line with the qualitative implication of the well-known CAP theorem. The paper proposes a set of novel mathematical models that are based on statistical analysis of collected data and enable quantified response time prediction depending on the timeout setup and on the level of consistency provided by the replicated system.

[1]  Zibin Zheng,et al.  WSRec: A Collaborative Filtering Based Web Service Recommender System , 2009, 2009 IEEE International Conference on Web Services.

[2]  Andrew S. Tanenbaum,et al.  Distributed systems: Principles and Paradigms , 2001 .

[3]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[4]  Vyacheslav S. Kharchenko,et al.  Using Inherent Service Redundancy and Diversity to Ensure Web Services Dependability , 2009, Methods, Models and Tools for Fault Tolerance.

[5]  Mohammad H. Alshayeji,et al.  A Study on Fault Tolerance Mechanisms in Cloud Computing , 2018 .

[6]  Ming Zhong,et al.  Replication degree customization for high availability , 2008, Eurosys '08.

[7]  W. Heisenberg Über den anschaulichen Inhalt der quantentheoretischen Kinematik und Mechanik , 1927 .

[8]  Valeria Cardellini,et al.  Performance and Dependability in Service Computing : Concepts , Techniques and Research Directions , 2022 .

[9]  Philipp Reinecke,et al.  Experimental Analysis of the Correlation of HTTP GET Invocations , 2006, EPEW.

[10]  Vyacheslav S. Kharchenko,et al.  Exploring Uncertainty of Delays as a Factor in End-to-End Cloud Response Time , 2012, 2012 Ninth European Dependable Computing Conference.

[11]  R. H. Myers,et al.  Probability and Statistics for Engineers and Scientists , 1978 .

[12]  Yogesh L. Simmhan,et al.  Demystifying Fog Computing: Characterizing Architectures, Applications and Abstractions , 2017, 2017 IEEE 1st International Conference on Fog and Edge Computing (ICFEC).

[13]  Rahul Potharaju,et al.  When the network crumbles: an empirical study of cloud network failures and their impact on services , 2013, SoCC.

[14]  E. Brewer,et al.  CAP twelve years later: How the "rules" have changed , 2012, Computer.

[15]  Fatimah M. Alturkistani,et al.  An Analytical Model for Availability Evaluation of Cloud Service Provisioning System , 2017 .

[16]  Philipp Reinecke,et al.  Phase-Type Approximations for Message Transmission Times in Web Services Reliable Messaging , 2008, SIPEW.

[17]  Vyacheslav S. Kharchenko,et al.  Dependability of Service-Oriented Computing: Time-Probabilistic Failure Modelling , 2012, SERENE.

[18]  Lianping Chen,et al.  Microservices: Architecting for Continuous Delivery and DevOps , 2018, 2018 IEEE International Conference on Software Architecture (ICSA).

[19]  Eric A. Brewer,et al.  System support for scalable and fault tolerant Internet services , 1999, Distributed Syst. Eng..

[20]  Rajkumar Buyya,et al.  Emergent Failures: Rethinking Cloud Reliability at Scale , 2018, IEEE Cloud Computing.

[21]  Rajkumar Buyya,et al.  Data Storage Management in Cloud Environments , 2017, ACM Comput. Surv..

[22]  Alexander Romanovsky,et al.  Time-Outing Internet Services , 2013, IEEE Security & Privacy.

[23]  G. Privitera Statistics for the Behavioral Sciences , 2011 .

[24]  Yury Izrailevsky,et al.  Cloud Reliability , 2018, IEEE Cloud Comput..

[25]  Nancy A. Lynch,et al.  Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services , 2002, SIGA.

[26]  Vyacheslav S. Kharchenko,et al.  The Impact of Consistency on System Latency in Fault Tolerant Internet Computing , 2015, DAIS.

[27]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[28]  Idit Keidar,et al.  Evaluating the running time of a communication round over the internet , 2002, PODC '02.

[29]  Harshpreet Singh,et al.  Review on Fault Tolerance Techniques in Cloud Computing , 2015 .

[30]  K. Hemant Kumar Reddy,et al.  Modeling and assessing reliability of service-oriented internet of things , 2019 .

[31]  Daniel J. Abadi,et al.  Consistency Tradeoffs in Modern Distributed Database System Design: CAP is Only Part of the Story , 2012, Computer.

[32]  Hermann Kopetz,et al.  Fault tolerance, principles and practice , 1990 .

[33]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[34]  Luis A. Morales Rosales,et al.  Survey on Web Services Fault Tolerance Approaches Based on Checkpointing Mechanisms , 2017, J. Softw..

[35]  Rajkumar Buyya,et al.  Failure Management for Reliable Cloud Computing: A Taxonomy, Model, and Future Directions , 2020, Computing in Science & Engineering.

[36]  Katinka Wolter,et al.  Analysis of Restart Mechanisms in Software Systems , 2006, IEEE Transactions on Software Engineering.

[37]  Jun Rao,et al.  Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore , 2011, Proc. VLDB Endow..