An Analysis of Network-Partitioning Failures in Cloud Systems

We present a comprehensive study of 136 system failures attributed to network-partitioning faults from 25 widely used distributed systems. We found that the majority of the failures led to catastrophic effects, such as data loss, reappearance of deleted data, broken locks, and system crashes. The majority of the failures can easily manifest once a network partition occurs: They require little to no client input, can be triggered by isolating a single node, and are deterministic. However, the number of test cases that one must consider is extremely large. Fortunately, we identify ordering, timing, and network fault characteristics that significantly simplify testing. Furthermore, we found that a significant number of the failures are due to design flaws in core system mechanisms. We found that the majority of the failures could have been avoided by design reviews, and could have been discovered by testing with network-partitioning fault injection. We built NEAT, a testing framework that simplifies the coordination of multiple clients and can inject different types of network-partitioning faults. We used NEAT to test seven popular systems and found and reported 32 failures.

[1]  Marvin Theimer,et al.  Managing update conflicts in Bayou, a weakly connected replicated storage system , 1995, SOSP.

[2]  Ju Wang,et al.  Windows Azure Storage: a highly available cloud storage service with strong consistency , 2011, SOSP.

[3]  Tanakorn Leesatapornwongsa,et al.  What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems , 2014, SoCC.

[4]  Stefan Savage,et al.  California fault lines: understanding the causes and impact of network failures , 2010, SIGCOMM '10.

[5]  Serdar Tasiran,et al.  VYRD: verifYing concurrent programs by runtime refinement-violation detection , 2005, PLDI '05.

[6]  Joseph M. Hellerstein,et al.  Lineage-driven Fault Injection , 2015, SIGMOD Conference.

[7]  Dutch T. Meyer,et al.  Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[8]  Brian F. Cooper Spanner: Google's globally-distributed database , 2013, SYSTOR '13.

[9]  Pallavi Joshi,et al.  SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems , 2014, OSDI.

[10]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.

[11]  Xin Chen,et al.  Failure Analysis of Jobs in Compute Clouds: A Google Cluster Case Study , 2014, 2014 IEEE 25th International Symposium on Software Reliability Engineering.

[12]  Robert Birke,et al.  Failure Analysis of Virtual and Physical Machines: Patterns, Causes and Characteristics , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[13]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[14]  Haryadi S. Gunawi,et al.  Why Does the Cloud Stop Computing?: Lessons from Hundreds of Service Outages , 2016, SoCC.

[15]  Jie Xu,et al.  An Empirical Failure-Analysis of a Large-Scale Cloud Computing Environment , 2014, 2014 IEEE 15th International Symposium on High-Assurance Systems Engineering.

[16]  Arkady Kanevsky,et al.  Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics , 2008, TOS.

[17]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[18]  Ethan Katz-Bassett,et al.  SPANStore: cost-effective geo-replicated storage spanning multiple cloud services , 2013, SOSP.

[19]  Leslie Lamport,et al.  Paxos Made Simple , 2001 .

[20]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[21]  Jay Kreps,et al.  Kafka : a Distributed Messaging System for Log Processing , 2011 .

[22]  Haoxiang Lin,et al.  MODIST: Transparent Model Checking of Unmodified Distributed Systems , 2009, NSDI.

[23]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[24]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[25]  Thomas R. Gross,et al.  Automatic testing of sequential and concurrent substitutability , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[26]  Nancy A. Lynch,et al.  Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services , 2002, SIGA.

[27]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[28]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[29]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[30]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[31]  Haoxiang Lin,et al.  An Empirical Study on Quality Issues of Production Big Data Platform , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[32]  Rupak Majumdar,et al.  Why is random testing effective for partition tolerance bugs? , 2017, Proc. ACM Program. Lang..

[33]  Min Zhu,et al.  B4: experience with a globally-deployed software defined wan , 2013, SIGCOMM.

[34]  Ramesh Govindan,et al.  Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure , 2016, SIGCOMM.

[35]  Anees Shaikh,et al.  A First Look at Problems in the Cloud , 2010, HotCloud.

[36]  Emin Gün Sirer,et al.  Beehive: O(1) Lookup Performance for Power-Law Query Distributions in Peer-to-Peer Overlays , 2004, NSDI.

[37]  Eric A. Brewer,et al.  Lessons from Giant-Scale Services , 2001, IEEE Internet Comput..

[38]  Bianca Schroeder,et al.  Reading between the lines of failure logs: Understanding how HPC systems fail , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[39]  Andrea C. Arpaci-Dusseau,et al.  FATE and DESTINI: A Framework for Cloud Recovery Testing , 2011, NSDI.

[40]  Marcos K. Aguilera,et al.  Consistency-based service level agreements for cloud storage , 2013, SOSP.

[41]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[42]  Navendu Jain,et al.  Understanding network failures in data centers: measurement, analysis, and implications , 2011, SIGCOMM.

[43]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[44]  Yu Luo,et al.  Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems , 2014, OSDI.

[45]  Patrice Godefroid,et al.  Model checking for programming languages using VeriSoft , 1997, POPL '97.

[46]  S. Savage,et al.  On Failure in Managed Enterprise Networks , 2012 .

[47]  Randy H. Katz,et al.  How Hadoop Clusters Break , 2013, IEEE Software.

[48]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[49]  Wei Lin,et al.  A characteristic study on failures of production distributed data-parallel programs , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[50]  Hui Ding,et al.  TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.

[51]  Christel Baier,et al.  Principles of model checking , 2008 .

[52]  Flavio Paiva Junqueira,et al.  Zab: High-performance broadcast for primary-backup systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[53]  Dinghao Wu,et al.  KISS: keep it simple and sequential , 2004, PLDI '04.

[54]  Kent L. Beck,et al.  Test-driven Development - by example , 2002, The Addison-Wesley signature series.