Systems Approaches to Tackling Configuration Errors

In recent years, configuration errors (i.e., misconfigurations) have become one of the dominant causes of system failures, resulting in many severe service outages and downtime. Unfortunately, it is notoriously difficult for system users (e.g., administrators and operators) to prevent, detect, and troubleshoot configuration errors due to the complexity of the configurations as well as the systems under configuration. As a result, the cost of resolving configuration errors is often tremendous from the aspects of both compensating the service disruptions and diagnosing, recovering from the failures. The prevalence, severity, and cost have made configuration errors one of the most thorny system problems that desire to be addressed. This survey article provides a holistic and structured overview of the systems approaches that tackle configuration errors. To understand the problem fundamentally, we first discuss the characteristics of configuration errors and the challenges of tackling such errors. Then, we discuss the state-of-the-art systems approaches that address different types of configuration errors in different scenarios. Our primary goal is to equip the stakeholder with a better understanding of configuration errors and the potential solutions for resolving configuration errors in the spectrum of system development and management. To inspire follow-up research, we further discuss the open problems with regard to system configuration. To the best of our knowledge, this is the first survey on the topic of tackling configuration errors.

[1]  Helen J. Wang,et al.  Automatic Misconfiguration Troubleshooting with PeerPressure , 2004, OSDI.

[2]  Felix Naumann,et al.  The Stratosphere platform for big data analytics , 2014, The VLDB Journal.

[3]  Dawson R. Engler,et al.  KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs , 2008, OSDI.

[4]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[5]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[6]  M. Alexander To Err is Human. , 2006, Journal of infusion nursing : the official publication of the Infusion Nurses Society.

[7]  Mary Shaw,et al.  The state of the art in end-user software engineering , 2011, ACM Comput. Surv..

[8]  Michael Stiber,et al.  A survey of system administrator mental models and situation awareness , 2001, SIGCPR '01.

[9]  Mark Zuckerberg,et al.  Why Software Is Eating the World , 2011 .

[10]  Eelco Dolstra,et al.  Purely Functional System Configuration Management , 2007, HotOS.

[11]  Richard P. Martin,et al.  Barricade: defending systems against operator mistakes , 2010, EuroSys '10.

[12]  Paul Anderson,et al.  Configuration Tools: Working Together , 2005, LISA.

[13]  Mona Attariyan,et al.  Using Causality to Diagnose Configuration Bugs , 2008, USENIX Annual Technical Conference.

[14]  Noah Treuhaft,et al.  Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies , 2002 .

[15]  Randy H. Katz,et al.  Precomputing possible configuration error diagnoses , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[16]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[17]  Jason Flinn,et al.  Automatically Generating Predicates and Solutions for Configuration Troubleshooting , 2009, USENIX Annual Technical Conference.

[18]  David A. Patterson,et al.  Undo for Operators: Building an Undoable E-mail Store , 2003, USENIX Annual Technical Conference, General Track.

[19]  Ratul Mahajan,et al.  Understanding BGP misconfiguration , 2002, SIGCOMM '02.

[20]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[21]  Krzysztof Czarnecki,et al.  Generating range fixes for software configuration , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[22]  Krzysztof Czarnecki,et al.  A user survey of configuration challenges in Linux and eCos , 2012, VaMoS '12.

[23]  Jakob Nielsen,et al.  Heuristic evaluation of user interfaces , 1990, CHI '90.

[24]  共立出版株式会社 コンピュータ・サイエンス : ACM computing surveys , 1978 .

[25]  Andreas Zeller,et al.  Why Programs Fail, Second Edition: A Guide to Systematic Debugging , 2009 .

[26]  Yu Luo,et al.  Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems , 2014, OSDI.

[27]  Yuanyuan Zhou,et al.  Understanding Customer Problem Troubleshooting from Storage System Logs , 2009, FAST.

[28]  Donald A. Norman,et al.  Design principles for human-computer interfaces , 1983, CHI '83.

[29]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[30]  Paul Anderson,et al.  Towards a High-Level Machine Configuration System , 1994, LISA.

[31]  Mona Attariyan,et al.  X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software , 2012, OSDI.

[32]  Eddie Kohler,et al.  Programming language techniques for modular router congurations , 2000 .

[33]  Yi-Min Wang,et al.  Discovering correctness constraints for self-management of system configuration , 2004, International Conference on Autonomic Computing, 2004. Proceedings..

[34]  Jakob Nielsen,et al.  Improving a human-computer dialogue , 1990, CACM.

[35]  Geoffrey M. Voelker,et al.  NetPrints: Diagnosing Home Network Misconfigurations Using Shared Knowledge , 2009, NSDI.

[36]  Gong Zhang,et al.  Why Do Migrations Fail and What Can We Do about It? , 2011, LISA.

[37]  Wouter Joosen,et al.  Fine-grained Access-control for the Puppet Configuration Language , 2011, LISA.

[38]  Aaron B. Brown Towards Availability and Maintainability Benchmarks: A Case Study of Software RAID Systems , 2001 .

[39]  Roy T. Fielding,et al.  Apache Server for Dummies (with CD-ROM) , 1998 .

[40]  Jean-Claude Laprie,et al.  Dependable computing: concepts, limits, challenges , 1995 .

[41]  Steven D. Gribble,et al.  Configuration Debugging as Search: Finding the Needle in the Haystack , 2004, OSDI.

[42]  Eben M. Haber,et al.  Design guidelines for system administration tools developed through ethnographic field studies , 2007, CHIMIT '07.

[43]  Xiao Ma,et al.  An empirical study on configuration errors in commercial and open source systems , 2011, SOSP.

[44]  Mona Attariyan,et al.  AutoBash: improving configuration management with operating system causality analysis , 2007, SOSP.

[45]  Tao Ye,et al.  A recursive random search algorithm for large-scale network parameter configuration , 2003, SIGMETRICS '03.

[46]  Dawson R. Engler,et al.  A few billion lines of code later , 2010, Commun. ACM.

[47]  Ding Yuan,et al.  Improving Software Diagnosability via Log Enhancement , 2012, TOCS.

[48]  Gerhard Weikum,et al.  Rethinking Database System Architecture: Towards a Self-Tuning RISC-Style Database System , 2000, VLDB.

[49]  Barton P. Miller,et al.  Fuzz Revisited: A Re-examination of the Reliability of UNIX Utilities and Services , 1995 .

[50]  Tanakorn Leesatapornwongsa,et al.  What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems , 2014, SoCC.

[51]  Aditya Akella,et al.  Demystifying configuration challenges and trade-offs in network-based ISP services , 2011, SIGCOMM.

[52]  Lujo Bauer,et al.  Detecting and resolving policy misconfigurations in access-control systems , 2008, SACMAT '08.

[53]  Kyrre M. Begnum,et al.  Educating System Administrators , 2014, login Usenix Mag..

[54]  Mona Attariyan,et al.  Automating Configuration Troubleshooting with Dynamic Information Flow Analysis , 2010, OSDI.

[55]  Mona Attariyan,et al.  Automating Configuration Troubleshooting with ConfAid , 2011, login Usenix Mag..

[56]  Stuart Kendrick What Takes Us Down? , 2012, login Usenix Mag..

[57]  Shivnath Babu,et al.  Tuning Database Configuration Parameters with iTuned , 2009, Proc. VLDB Endow..

[58]  Susan Coghlan,et al.  A Case Study in Configuration Management Tool Deployment , 2005, LISA.

[59]  Richard P. Martin,et al.  Human-Aware Computer System Design , 2005, HotOS.

[60]  Wouter Joosen,et al.  A Survey of System Configuration Tools , 2010, LISA.

[61]  Randy H. Katz,et al.  Static extraction of program configuration options , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[62]  Soudip Roy Chowdhury,et al.  Determining configuration parameter dependencies via analysis of configuration data from multi-tiered enterprise applications , 2009, ICAC '09.

[63]  Andrew D. Gordon,et al.  A Declarative Approach to Automated Configuration , 2012, LISA.

[64]  James Cheney,et al.  Toward Provenance-Based Security for Configuration Languages , 2012, TaPP.

[65]  Alistair N. Coles,et al.  The SmartFrog configuration management framework , 2009, OPSR.

[66]  Karen Watterson The system administrator , 1990 .

[67]  Richard P. Martin,et al.  Understanding and Validating Database System Administration , 2006, USENIX Annual Technical Conference, General Track.

[68]  Eric Anderson,et al.  Proceedings of the Fast 2002 Conference on File and Storage Technologies Hippodrome: Running Circles around Storage Administration , 2022 .

[69]  Takayuki Osogami,et al.  Finding probably better system configurations quickly , 2006, SIGMETRICS '06/Performance '06.

[70]  Amin Vahdat,et al.  Managing energy and server resources in hosting centers , 2001, SOSP.

[71]  Xiaogang Liu,et al.  LiveOps: Systems Management as a Service , 2006, LISA.

[72]  Sharad Malik,et al.  Declarative Infrastructure Configuration Synthesis and Debugging , 2008, Journal of Network and Systems Management.

[73]  Farnam Jahanian,et al.  Experimental study of Internet stability and backbone failures , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[74]  Tianyin Xu,et al.  EnCore: exploiting system environment and correlation information for misconfiguration detection , 2014, ASPLOS.

[75]  Peng Huang,et al.  ConfValley: a systematic configuration validation framework for cloud services , 2015, EuroSys.

[76]  David A. Maltz,et al.  Unraveling the Complexity of Network Management , 2009, NSDI.

[77]  Roy T. Fielding,et al.  Apache Server For Dummies , 1998 .

[78]  Nicole F. Velasquez,et al.  Designing Tools for System Administrators: An Empirical Test of the Integrated User Satisfaction Model , 2008, LISA.

[79]  Jeffrey O. Kephart,et al.  The Vision of Autonomic Computing , 2003, Computer.

[80]  Cory Lueninghoener Getting Started with Configuration Management , 2011, login Usenix Mag..

[81]  Andreas Zeller,et al.  Why Programs Fail: A Guide to Systematic Debugging , 2005 .

[82]  Ran Wolff,et al.  Mining for misconfigured machines in grid systems , 2006, KDD '06.

[83]  Michael D. Ernst,et al.  Automated diagnosis of software configuration errors , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[84]  Long Jin,et al.  Hey, you have given me too many knobs!: understanding and dealing with over-designed configuration in system software , 2015, ESEC/SIGSOFT FSE.

[85]  Benoît Pelopidas,et al.  Normal Accidents Living with High-Risk Technologies , 2012 .

[86]  Wei-Ying Ma,et al.  Automated known problem diagnosis with event traces , 2006, EuroSys.

[87]  Yuanyuan Zhou,et al.  Do not blame users for misconfigurations , 2013, SOSP.

[88]  Bowei Xi,et al.  A smart hill-climbing algorithm for application server configuration , 2004, WWW '04.

[89]  Jing Yuan,et al.  Generic and automatic address configuration for data center networks , 2010, SIGCOMM '10.

[90]  Krzysztof Czarnecki,et al.  Mining configuration constraints: static analyses and empirical results , 2014, ICSE.

[91]  Sanjai Narain,et al.  Network Configuration Management via Model Finding , 2005, LISA.

[92]  Randy H. Katz,et al.  How Hadoop Clusters Break , 2013, IEEE Software.

[93]  Shan Lu,et al.  Flight data recorder: monitoring persistent-state interactions to improve systems management , 2006, OSDI '06.

[94]  Gernot Heiser,et al.  From L3 to seL4 what have we learnt in 20 years of L4 microkernels? , 2013, SOSP.

[95]  Mark Burgess,et al.  A Site Configuration Engine , 1995, Comput. Syst..

[96]  John DeTreville Making System Configuration More Declarative , 2005, HotOS.

[97]  Eser Kandogan,et al.  Field studies of computer system administrators: analysis of system management tools and practices , 2004, CSCW.

[98]  J. E. Groves,et al.  Made in America: Science, Technology and American Modernist Poets , 1989 .

[99]  Richard P. Martin,et al.  Understanding and Dealing with Operator Mistakes in Internet Services , 2004, OSDI.

[100]  Dina Katabi,et al.  Enabling Configuration-Independent Automation by Non-Expert Users , 2010, OSDI.

[101]  Lorenzo Keller,et al.  ConfErr: A tool for assessing resilience to human configuration errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[102]  Martin Szummer,et al.  Snitch: interactive decision trees for troubleshooting misconfigurations , 2007 .

[103]  Nick Feamster,et al.  Detecting BGP configuration faults with static analysis , 2005 .

[104]  Donald A. Norman,et al.  Design rules based on analyses of human error , 1983, CACM.

[105]  Wei Zheng,et al.  Automatic configuration of internet services , 2007, EuroSys '07.

[106]  David Snyder,et al.  Auto-configuration by File Construction: Configuration Management with newfig , 2004, LISA.

[107]  J. Shaoul Human Error , 1973, Nature.

[108]  Helen J. Wang,et al.  Strider: a black-box, state-based approach to change and configuration management and support , 2003, Sci. Comput. Program..

[109]  Albert G. Greenberg,et al.  Configuration management at massive scale: system design and experience , 2007, IEEE Journal on Selected Areas in Communications.

[110]  Junfeng Yang,et al.  Context-based Online Configuration-Error Detection , 2011, USENIX Annual Technical Conference.

[111]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition , 2013, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition.

[112]  Andreas Zeller CHAPTER 6 – Scientific Debugging , 2009 .

[113]  Robert W. Reeder,et al.  Improving user-interface dependability through mitigation of human error , 2005, Int. J. Hum. Comput. Stud..

[114]  Michael D. Ernst,et al.  Which configuration option should I change? , 2014, ICSE.

[115]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.