A survey on self-healing systems: approaches and systems

Present large-scale information technology environments are complex, heterogeneous compositions often affected by unpredictable behavior and poor manageability. This fostered substantial research on designs and techniques that enhance these systems with an autonomous behavior. In this survey, we focus on the self-healing branch of the research and give an overview of the current existing approaches. The survey is introduced by an outline of the origins of self-healing. Based on the principles of autonomic computing and self-adapting system research, we identify self-healing systems’ fundamental principles. The extracted principles support our analysis of the collected approaches. In a final discussion, we summarize the approaches’ common and individual characteristics. A comprehensive tabular overview of the researched material concludes the survey.

[1]  Manfred Broy Requirements Engineering for Embedded Systems) , 2003 .

[2]  D. L. Parnas,et al.  On the criteria to be used in decomposing systems into modules , 1972, Software Pioneers.

[3]  Rosalind W. Picard Affective computing: (526112012-054) , 1997 .

[4]  Herbert Bos,et al.  MINIX 3: a highly reliable, self-repairing operating system , 2006, OPSR.

[5]  Ali Akoglu,et al.  FPGA based distributed self healing architecture for reusable systems , 2009, Cluster Computing.

[6]  Ali Akoglu,et al.  Hierarchical Built-in Self-testing and FPGA Based Healing Methodology for System-on-a-Chip , 2007, Second NASA/ESA Conference on Adaptive Hardware and Systems (AHS 2007).

[7]  Jack J. Dongarra,et al.  Scalable Fault Tolerant Protocol for Parallel Runtime Environments , 2006, PVM/MPI.

[8]  Robert Hirschfeld,et al.  Self-Sustaining Systems, First Workshop, S3 2008, Potsdam, Germany, May 15-16, 2008, Revised Selected Papers , 2008, S3.

[9]  Martin Lukasiewycz,et al.  Symbolic Reliability Analysis of Self-healing Networked Embedded Systems , 2008, SAFECOMP.

[10]  Amin Vahdat,et al.  Design and implementation tradeoffs for wide-area resource discovery , 2005, HPDC-14. Proceedings. 14th IEEE International Symposium on High Performance Distributed Computing, 2005..

[11]  Shlomi Dolev,et al.  Self-stabilizing group communication in directed networks , 2003, Acta Informatica.

[12]  Anish Arora,et al.  Closure and Convergence: A Foundation of Fault-Tolerant Computing , 1993, IEEE Trans. Software Eng..

[13]  Michael G. Merideth Enhancing Survivability with Proactive Fault-Containment , 2003 .

[14]  Jack J. Dongarra,et al.  Self-healing network for scalable fault-tolerant runtime environments , 2010, Future Gener. Comput. Syst..

[15]  Kevin Mills,et al.  Understanding self-healing in service-discovery systems , 2002, WOSS '02.

[16]  Vladimir Getov,et al.  Intelligent architecture for automatic resource allocation in computer clusters , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[17]  Thomas Ledoux,et al.  OpenCorba: A Reflektive Open Broker , 1999, Reflection.

[18]  David Sinreich,et al.  An architectural blueprint for autonomic computing , 2006 .

[19]  Rajarshi Das,et al.  A multi-agent systems approach to autonomic computing , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[20]  Pattie Maes,et al.  Concepts and experiments in computational reflection , 1987, OOPSLA '87.

[21]  Herbert Bos,et al.  Can we make operating systems reliable and secure? , 2006, Computer.

[22]  Stefano Modafferi,et al.  SH-BPEL: a self-healing plug-in for Ws-BPEL engines , 2006, MW4SOC '06.

[23]  Fabio Kon,et al.  Monitoring, Security, and Dynamic Configuration with the dynamicTAO Reflective ORB , 2000, Middleware.

[24]  Julie A. McCann,et al.  A survey of autonomic computing—degrees, models, and applications , 2008, CSUR.

[25]  Roy Sterritt Autonomic computing , 2005, Innovations in Systems and Software Engineering.

[26]  F. Moo-Mena,et al.  Defining a Self-Healing QoS-based Infrastructure for Web Services Applications , 2008, 2008 11th IEEE International Conference on Computational Science and Engineering - Workshops.

[27]  Wai Chen,et al.  Alarm model specification and dynamic multi-layer self-healing mechanisms for commercial and ad-hoc wireless networks , 2004, 2004 IEEE 15th International Symposium on Personal, Indoor and Mobile Radio Communications (IEEE Cat. No.04TH8754).

[28]  Michael N. Huhns,et al.  Robust software via agent-based redundancy , 2003, AAMAS '03.

[29]  Simon A. Dobson,et al.  Cross-Layer Architectures for Autonomic Communications , 2006, Journal of Network and Systems Management.

[30]  Douglas L. Jones,et al.  GRACE-1: cross-layer adaptation for multimedia quality and battery energy , 2006, IEEE Transactions on Mobile Computing.

[31]  David Garlan,et al.  Proceedings of the 2006 international workshop on Self-adaptation and self-managing systems , 2006, ICSE 2006.

[32]  Zakaria Maamar,et al.  On the Enhancement of BPEL Engines for Self-Healing Composite Web Services , 2008, 2008 International Symposium on Applications and the Internet.

[33]  Edsger W. Dijkstra,et al.  Self-stabilizing systems in spite of distributed control , 1974, CACM.

[34]  Gail E. Kaiser,et al.  Manipulating managed execution runtimes to support self-healing systems , 2005, ACM SIGSOFT Softw. Eng. Notes.

[35]  Bradley R. Schmerl,et al.  Software architecture-based adaptation for Grid computing , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[36]  Pascal Fradet,et al.  Unconventional Programming Paradigms , 2008 .

[37]  Salim Hariri,et al.  Autonomic Computing: An Overview , 2004, UPP.

[38]  E. Vesterinen,et al.  Affective Computing , 2009, Encyclopedia of Biometrics.

[39]  Michael N. Huhns,et al.  A Foundational Analysis of Software Robustness Using Redundant Agent Collaboration , 2002, Agent Technologies, Infrastructures, Tools, and Applications for E-Services.

[40]  Edmund M. Clarke,et al.  Avoiding the state explosion problem in temporal logic model checking , 1987, PODC '87.

[41]  Luciano Baresi,et al.  Self-healing BPEL processes with Dynamo and the JBoss rule engine , 2007, ESSPE '07.

[42]  Gordon S. Blair,et al.  Reflection, self-awareness and self-healing in OpenORB , 2002, WOSS '02.

[43]  Jordi Torres,et al.  Towards Self-adaptable Monitoring Framework for Self-healing , 2008, CoreGRID Workshop on Grid Middleware.

[44]  Frances M. T. Brazier,et al.  A Self-Healing Approach for Object-Oriented Applications , 2005, 16th International Workshop on Database and Expert Systems Applications (DEXA'05).

[45]  Nancy R. Mead,et al.  Survivability: Protecting Your Critical Systems , 1999, IEEE Internet Comput..

[46]  Luciano Baresi,et al.  Dynamo and Self-Healing BPEL Compositions , 2007, 29th International Conference on Software Engineering (ICSE'07 Companion).

[47]  Claudio Gutierrez,et al.  Survey of graph database models , 2008, CSUR.

[48]  Hermann Kopetz,et al.  Real-time systems , 2018, CSC '73.

[49]  Richard Murch,et al.  Autonomic Computing , 2004 .

[50]  Bradley R. Schmerl,et al.  Using Architectural Style as a Basis for System Self-repair , 2002, WICSA.

[51]  Nancy R. Mead,et al.  Requirements definition for survivable network systems , 1998, Proceedings of IEEE International Symposium on Requirements Engineering: RE '98.

[52]  Stefano Modafferi,et al.  Methods for Enabling Recovery Actions in Ws-BPEL , 2006, OTM Conferences.

[53]  Aaron Sloman,et al.  Why Robots Will Have Emotions , 1981, IJCAI.

[54]  George Coulouris,et al.  Distributed systems - concepts and design , 1988 .

[55]  Richard N. Taylor,et al.  Towards architecture-based self-healing systems , 2002, WOSS '02.

[56]  Debzani Deb,et al.  Adding Self-Healing Capabilities into Legacy Object Oriented Application , 2006, International Conference on Autonomic and Autonomous Systems (ICAS'06).

[57]  Bradley R. Schmerl,et al.  Architecture-based self-adaptation in the presence of multiple objectives , 2006, SEAMS '06.

[58]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[59]  Donald A. Norman,et al.  Affect and machine design: Lessons for the development of autonomous machines , 2003, IBM Syst. J..

[60]  Martin Lukasiewycz,et al.  Reliability-Aware System Synthesis , 2007 .

[61]  Jeffrey O. Kephart,et al.  An architectural approach to autonomic computing , 2004 .

[62]  Sukumar Ghosh,et al.  Distributed Systems , 2018 .

[63]  Douglas L. Jones,et al.  The Illinois GRACE Project: Global Resource Adaptation through CoopEration , 2002 .

[64]  Mohamed Jmaiel,et al.  A QoS-Oriented Reconfigurable Middleware for Self-Healing Web Services , 2008, 2008 IEEE International Conference on Web Services.

[65]  Wei-Tek Tsai,et al.  Towards self-healing systems via dependable architecture and reflective middleware , 2005, 10th IEEE International Workshop on Object-Oriented Real-Time Dependable Systems.

[66]  Christian Haubelt,et al.  An Operating System Infrastructure for Fault-Tolerant Reconfigurable Networks , 2006, ARCS.

[67]  M. Salehie,et al.  Autonomic computing , 2005, ACM SIGSOFT Softw. Eng. Notes.

[68]  Thomas A. Corbi,et al.  The dawning of the autonomic computing era , 2003, IBM Syst. J..

[69]  Michael W. Shapiro Self-Healing in Modern Operating Systems , 2004, ACM Queue.

[70]  George Candea,et al.  Improving availability with recursive microreboots: a soft-state system case study , 2004, Perform. Evaluation.

[71]  Ladan Tahvildari,et al.  Self-adaptive software: Landscape and research challenges , 2009, TAAS.

[72]  David Garlan,et al.  Proceedings of the First ACM SIGSOFT Workshop on Self-Managing Systems (WOSS '04) : October 31-November 1, 2004, Newport Beach, CA, USA , 2004 .

[73]  John Kubiatowicz,et al.  Handling churn in a DHT , 2004 .

[74]  Gordon S. Blair,et al.  The Design and Implementation of Open ORB 2 , 2001, IEEE Distributed Syst. Online.

[75]  Schahram Dustdar,et al.  Non-intrusive monitoring and service adaptation for WS-BPEL , 2008, WWW.

[76]  Louis Rilling,et al.  Vigne: Towards a Self-healing Grid Operating System , 2006, Euro-Par.

[77]  Jeffrey O. Kephart,et al.  The Vision of Autonomic Computing , 2003, Computer.

[78]  Michael Rovatsos,et al.  Capturing agent autonomy in roles and XML , 2003, AAMAS '03.

[79]  Christian Bonnet,et al.  CrossTalk: cross-layer decision support based on global knowledge , 2006, IEEE Communications Magazine.

[80]  Wai Chen,et al.  Service survivability in wireless networks via multi-layer self-healing , 2005, IEEE Wireless Communications and Networking Conference, 2005.

[81]  Jeffrey O. Kephart,et al.  An artificial intelligence perspective on autonomic computing policies , 2004, Proceedings. Fifth IEEE International Workshop on Policies for Distributed Systems and Networks, 2004. POLICY 2004..

[82]  Stanley M. Sutton,et al.  N degrees of separation: multi-dimensional separation of concerns , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[83]  Jonathan E. Cook,et al.  Infrastructure for Making Legacy Systems Self-Managed Naoman , 2004 .

[84]  Â. È Ê Â Â Û Û Ò Ç ^ R R Ó Ae — Ú Ú Ë Ë Â Ê Ì È Â Ê Â Verifying Temporal Properties without Temporal Logic , 1988 .

[85]  Debanjan Ghosh,et al.  Self-healing systems - survey and synthesis , 2007, Decis. Support Syst..

[86]  Yixin Diao,et al.  ABLE: A toolkit for building multiagent autonomic systems , 2002, IBM Syst. J..

[87]  Petr Jan Horn,et al.  Autonomic Computing: IBM's Perspective on the State of Information Technology , 2001 .

[88]  Richard P. Gabriel,et al.  On Sustaining Self , 2008, S3.

[89]  Nicholas R. Jennings,et al.  On agent-based software engineering , 2000, Artif. Intell..