On fault resilience of OpenStack

Cloud-management stacks have become an increasingly important element in cloud computing, serving as the resource manager of cloud platforms. While the functionality of this emerging layer has been constantly expanding, its fault resilience remains under-studied. This paper presents a systematic study of the fault resilience of OpenStack---a popular open source cloud-management stack. We have built a prototype fault-injection framework targeting service communications during the processing of external requests, both among OpenStack services and between OpenStack and external services, and have thus far uncovered 23 bugs in two versions of OpenStack. Our findings shed light on defects in the design and implementation of state-of-the-art cloud-management stacks from a fault-resilience perspective.

[1]  Tanakorn Leesatapornwongsa,et al.  Limplock: understanding the impact of limpware on scale-out cloud systems , 2013, SoCC.

[2]  Yingwei Luo,et al.  Failure Recovery: When the Cure Is Worse Than the Disease , 2013, HotOS.

[3]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[4]  George Candea,et al.  Scalable testing of file system checkers , 2012, EuroSys '12.

[5]  Long Wang,et al.  Dissecting Open Source Cloud Evolution: An OpenStack Case Study , 2013, HotCloud.

[6]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[7]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[8]  Andrea C. Arpaci-Dusseau,et al.  EIO: Error Handling is Occasionally Correct , 2008, FAST.

[9]  Marcos K. Aguilera,et al.  Detecting failures in distributed systems with the Falcon spy network , 2011, SOSP.

[10]  Haoxiang Lin,et al.  MODIST: Transparent Model Checking of Unmodified Distributed Systems , 2009, NSDI.

[11]  Marcos K. Aguilera,et al.  WAP5: black-box performance debugging for wide-area systems , 2006, WWW '06.

[12]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[13]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[14]  Xuezheng Liu,et al.  D3S: Debugging Deployed Distributed Systems , 2008, NSDI.

[15]  George Candea,et al.  Efficient Testing of Recovery Code Using Fault Injection , 2011, TOCS.

[16]  A. Kivity,et al.  kvm : the Linux Virtual Machine Monitor , 2007 .

[17]  Barton P. Miller,et al.  An empirical study of the reliability of UNIX utilities , 1990, Commun. ACM.

[18]  T. S. Eugene Ng,et al.  Understanding the effects and implications of compute node related failures in hadoop , 2012, HPDC '12.

[19]  Wei Lin,et al.  WiDS Checker: Combating Bugs in Distributed Systems , 2007, NSDI.

[20]  Chun Zhang,et al.  vPath: Precise Discovery of Request Processing Paths from Black-Box Observations of Thread and Network Activities , 2009, USENIX Annual Technical Conference.

[21]  Andrea C. Arpaci-Dusseau,et al.  FATE and DESTINI: A Framework for Cloud Recovery Testing , 2011, NSDI.

[22]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[23]  Scott Shenker,et al.  Replay debugging for distributed applications , 2006 .

[24]  Ion Stoica,et al.  Failure as a Service (FaaS): A Cloud Service for Large- Scale, Online Failure Drills , 2011 .

[25]  Úlfar Erlingsson,et al.  Fay: extensible distributed tracing from kernels to clusters , 2011, SOSP '11.

[26]  Kang G. Shin,et al.  Towards a Fault-Resilient Cloud Management Stack , 2013, HotCloud.