An empirical study on configuration errors in commercial and open source systems

Configuration errors (i.e., misconfigurations) are among the dominant causes of system failures. Their importance has inspired many research efforts on detecting, diagnosing, and fixing misconfigurations; such research would benefit greatly from a real-world characteristic study on misconfigurations. Unfortunately, few such studies have been conducted in the past, primarily because historical misconfigurations usually have not been recorded rigorously in databases. In this work, we undertake one of the first attempts to conduct a real-world misconfiguration characteristic study. We study a total of 546 real world misconfigurations, including 309 misconfigurations from a commercial storage system deployed at thousands of customers, and 237 from four widely used open source systems (CentOS, MySQL, Apache HTTP Server, and OpenLDAP). Some of our major findings include: (1) A majority of misconfigurations (70.0%~85.5%) are due to mistakes in setting configuration parameters; however, a significant number of misconfigurations are due to compatibility issues or component configurations (i.e., not parameter-related). (2) 38.1%~53.7% of parameter mistakes are caused by illegal parameters that clearly violate some format or rules, motivating the use of an automatic configuration checker to detect these misconfigurations. (3) A significant percentage (12.2%~29.7%) of parameter-based mistakes are due to inconsistencies between different parameter values. (4) 21.7%~57.3% of the misconfigurations involve configurations external to the examined system, some even on entirely different hosts. (5) A significant portion of misconfigurations can cause hard-to-diagnose failures, such as crashes, hangs, or severe performance degradation, indicating that systems should be better-equipped to handle misconfigurations.

[1]  Wei Zheng,et al.  Automatic configuration of internet services , 2007, EuroSys '07.

[2]  Donald E. Porter,et al.  Improved error reporting for software that uses black-box components , 2007, PLDI '07.

[3]  Helen J. Wang,et al.  Strider: a black-box, state-based approach to change and configuration management and support , 2003, Sci. Comput. Program..

[4]  Rui Wang,et al.  Towards automatic reverse engineering of software security configurations , 2008, CCS.

[5]  Richard P. Martin,et al.  Understanding and Dealing with Operator Mistakes in Internet Services , 2004, OSDI.

[6]  Dina Katabi,et al.  Enabling Configuration-Independent Automation by Non-Expert Users , 2010, OSDI.

[7]  Ricardo Bianchini,et al.  Staged deployment in mirage, an integrated software upgrade testing and distribution system , 2007, SOSP.

[8]  Wei-Ying Ma,et al.  Automated known problem diagnosis with event traces , 2006, EuroSys.

[9]  Adam A. Porter,et al.  Using symbolic evaluation to understand behavior in configurable software systems , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[10]  Lorenzo Keller,et al.  ConfErr: A tool for assessing resilience to human configuration errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[11]  Robert W. Reeder,et al.  Improving user-interface dependability through mitigation of human error , 2005, Int. J. Hum. Comput. Stud..

[12]  Brendan Murphy,et al.  Measuring system and software reliability using an automated data collection process , 1995 .

[13]  Mona Attariyan,et al.  Using Causality to Diagnose Configuration Bugs , 2008, USENIX Annual Technical Conference.

[14]  Junfeng Yang,et al.  An empirical study of operating systems errors , 2001, SOSP.

[15]  Mark Sullivan,et al.  A comparison of software defects in database management systems and operating systems , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[16]  Patrick Goldsack,et al.  SmartFrog Meets LCFG: Autonomous Reconfiguration with Central Policy Control , 2003, LISA.

[17]  Ding Yuan,et al.  Improving Software Diagnosability via Log Enhancement , 2012, TOCS.

[18]  Mona Attariyan,et al.  AutoBash: improving configuration management with operating system causality analysis , 2007, SOSP.

[19]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[20]  Helen J. Wang,et al.  Automatic Misconfiguration Troubleshooting with PeerPressure , 2004, OSDI.

[21]  Nick Feamster,et al.  Detecting BGP configuration faults with static analysis , 2005 .

[22]  Richard P. Martin,et al.  Barricade: defending systems against operator mistakes , 2010, EuroSys '10.

[23]  Noah Treuhaft,et al.  Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies , 2002 .

[24]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[25]  David A. Patterson,et al.  Undo for Operators: Building an Undoable E-mail Store , 2003, USENIX Annual Technical Conference, General Track.

[26]  Randy H. Katz,et al.  Static extraction of program configuration options , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[27]  Mark Sullivan,et al.  Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[28]  Steven D. Gribble,et al.  Configuration Debugging as Search: Finding the Needle in the Haystack , 2004, OSDI.

[29]  Mona Attariyan,et al.  Automating Configuration Troubleshooting with Dynamic Information Flow Analysis , 2010, OSDI.

[30]  Richard P. Martin,et al.  Understanding and Validating Database System Administration , 2006, USENIX Annual Technical Conference, General Track.

[31]  Soudip Roy Chowdhury,et al.  Determining configuration parameter dependencies via analysis of configuration data from multi-tiered enterprise applications , 2009, ICAC '09.

[32]  Yuanyuan Zhou,et al.  Learning from mistakes: a comprehensive study on real world concurrency bug characteristics , 2008, ASPLOS.