Experience Report: On the Impact of Software Faults in the Privileged Virtual Machine

Cloud computing is revolutionizing how organizations treat computing resources. The privileged virtual machine is a key component in systems that use virtualization, but poses a dependability risk for several reasons. The activation of residual software faults that exist in every software project is a real threat and can impact the correct operation of the entire virtualized system. To study this question, we begin by performing a detailed analysis of the privileged virtual machine and its components, followed by software fault injection campaigns that target two of those important components – toolstack and a device driver. The obstacles faced during this experimental phase and how they were overcome is herein described with practitioners in mind. The results show that software faults in those components can have either no impact or lead to drastic failures, showing that the privileged virtual machine is a single point of failure that must be protected (for 4-9% of the faults). Most of the failures are detectable by monitoring basic functionalities, but some faults caused inconsistent states that manifest later on. No silent data failures (SDF) have been observed, but the number of faults injected so far only allows to conclude that SDF are not very frequent.

[1]  Ravishankar K. Iyer,et al.  CloudVal: A framework for validation of virtualization environment in cloud infrastructure , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[2]  Alberto Sillitti,et al.  Using the Eclipse C/C++ Development Tooling as a Robust, Fully Functional, Actively Maintained, Open Source C++ Parser , 2012, OSS.

[3]  Elaine J. Weyuker,et al.  The distribution of faults in a large industrial software system , 2002, ISSTA '02.

[4]  Michael Le,et al.  ReHype: enabling VM survival across hypervisor failures , 2011, VEE '11.

[5]  Dhabaleswar K. Panda,et al.  High Performance VMM-Bypass I/O in Virtual Machines , 2006, USENIX Annual Technical Conference, General Track.

[6]  Yennun Huang,et al.  Software rejuvenation: analysis, module and applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[7]  Bernhard Jansen,et al.  Architecting Dependable and Secure Systems Using Virtualization , 2007, WADS.

[8]  Daniele Catteddu and Giles Hogben Cloud Computing. Benefits, risks and recommendations for information security , 2009 .

[9]  Jong Sou Park,et al.  Improving Fault Tolerance by Virtualization and Software Rejuvenation , 2008, 2008 Second Asia International Conference on Modelling & Simulation (AMS).

[10]  Junfeng Yang,et al.  An empirical study of operating systems errors , 2001, SOSP.

[11]  Jean Arlat,et al.  Characterization of the impact of faulty drivers on the robustness of the Linux kernel , 2004, International Conference on Dependable Systems and Networks, 2004.

[12]  David S. Rosenblum A Practical Approach to Programming With Assertions , 1995, IEEE Trans. Software Eng..

[13]  R. Lipton,et al.  Mutation analysis , 1998 .

[14]  Dutch T. Meyer,et al.  Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[15]  Henrique Madeira,et al.  Emulation of Software Faults: A Field Data Study and a Practical Approach , 2006, IEEE Transactions on Software Engineering.

[16]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[17]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[18]  Shigeru Chiba,et al.  Fast Software Rejuvenation of Virtual Machine Monitors , 2011, IEEE Transactions on Dependable and Secure Computing.

[19]  Andrew Warfield,et al.  Safe Hardware Access with the Xen Virtual Machine Monitor , 2007 .

[20]  Richard J. Lipton,et al.  Hints on Test Data Selection: Help for the Practicing Programmer , 1978, Computer.

[21]  David Lorge Parnas,et al.  Software aging , 1994, Proceedings of 16th International Conference on Software Engineering.

[22]  Henrique Madeira,et al.  Practical Emulation of Software Defects in Source Code , 2016, 2016 12th European Dependable Computing Conference (EDCC).

[23]  P. Mell,et al.  The NIST Definition of Cloud Computing , 2011 .

[24]  Domenico Cotroneo,et al.  On Fault Representativeness of Software Fault Injection , 2013, IEEE Transactions on Software Engineering.

[25]  Brian N. Bershad,et al.  Recovering device drivers , 2004, TOCS.

[26]  Pedro Costa,et al.  Practical and representative faultloads for large-scale software systems , 2015, J. Syst. Softw..

[27]  Henrique Madeira,et al.  Recovery for Virtualized Environments , 2015, 2015 11th European Dependable Computing Conference (EDCC).

[28]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[29]  Robert W. Floyd,et al.  Assigning Meanings to Programs , 1993 .

[30]  David Chisnall,et al.  The Definitive Guide to the Xen Hypervisor , 2007 .