Root cause analysis in IT infrastructures using ontologies and abduction in Markov Logic Networks

Information systems play a crucial role in most of today’s business operations. High availability and reliability of services and hardware, and, in the case of outages, short response times are essential. Thus, a high amount of tool support and automation in risk management is desirable to decrease downtime. We propose a new approach for calculating the root cause for an observed failure in an IT infrastructure. Our approach is based on abduction in Markov Logic Networks. Abduction aims to find an explanation for a given observation in the light of some background knowledge. In failure diagnosis, the explanation corresponds to the root cause, the observation to the failure of a component, and the background knowledge to the dependency graph extended by potential risks. We apply a method to extend a Markov Logic Network in order to conduct abductive reasoning, which is not naturally supported in this formalism. Our approach exhibits a high amount of reusability and facilitates modeling by using ontologies as background knowledge. This enables users without spe- cific knowledge of a concrete infrastructure to gain viable insights in the case of an incident. We implemented the method in a tool and illustrate its suitabil- ity for root cause analysis by applying it to a sample scenario and testing its scalability on randomly generated infrastructures.

[1]  Mark E. Stickel,et al.  A prolog-like inference system for computing minimum-cost abductive explanations in natural-language interpretation , 1991, Annals of Mathematics and Artificial Intelligence.

[2]  Rohit J. Kate and Raymond J. Mooney Probabilistic Abduction using Markov Logic Networks , 2009 .

[3]  Hwee Tou Ng,et al.  An Efficient First-Order Abduction System Based on the ATMS , 1991 .

[4]  Boris Motik,et al.  HermiT: An OWL 2 Reasoner , 2014, Journal of Automated Reasoning.

[5]  Harry E. Pople,et al.  Session 6 Theorem Proving and Logic: I I ON THE MECHANIZATION OF ABDUCTIVE LOGIC , 2006 .

[6]  Michael Beetz,et al.  Extending Markov Logic to Model Probability Distributions in Relational Domains , 2007, KI.

[7]  Trevor J. M. Bench-Capon,et al.  METHODOLOGIES FOR ONTOLOGY DEVELOPMENT , 2007 .

[8]  Kentaro Inui,et al.  ILP-Based Reasoning for Weighted Abduction , 2011, Plan, Activity, and Intent Recognition.

[9]  Jan vom Brocke,et al.  Living IT infrastructures - An ontology-based approach to aligning IT infrastructure capacity and business needs , 2014, Int. J. Account. Inf. Syst..

[10]  Paolo Mancarella,et al.  Abductive Logic Programming , 1992, LPNMR.

[11]  Anders L. Madsen,et al.  Applications of object-oriented Bayesian networks for condition monitoring, root cause analysis and decision support on operation of complex continuous processes , 2005, Comput. Chem. Eng..

[12]  Thomas Kirste,et al.  Concept and Realization of a Diagnostic System for Smart Environments , 2017, ICAART.

[13]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[14]  James J. Rooney,et al.  Root cause analysis for beginners , 2004 .

[15]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[16]  Janno von Stülpnagel,et al.  IT Risk Management with Markov Logic Networks , 2014, CAiSE.

[17]  Randy Goebel,et al.  Theorist: A Logical Reasoning System for Defaults and Diagnosis , 1987 .

[18]  Werner Nutt,et al.  Basic Description Logics , 2003, Description Logic Handbook.

[19]  Jerry R. Hobbs,et al.  Abductive Reasoning with a Large Knowledge Base for Discourse Processing , 2011, IWCS.

[20]  Srikanth Kandula,et al.  Shrink: a tool for failure diagnosis in IP networks , 2005, MineNet '05.

[21]  Tzong-Shi Liu,et al.  The application of Petri nets to failure analysis , 1997 .

[22]  Yarden Katz,et al.  Pellet: A practical OWL-DL reasoner , 2007, J. Web Semant..

[23]  Mark A. Musen,et al.  The protégé project: a look back and a look forward , 2015, SIGAI.

[24]  Malgorzata Steinder,et al.  Increasing robustness of fault localization through analysis of lost, spurious, and positive symptoms , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[25]  Wilhelm Hasselbring,et al.  Automatic Failure Diagnosis Support in Distributed Large-Scale Software Systems Based on Timing Behavior Anomaly Correlation , 2009, 2009 13th European Conference on Software Maintenance and Reengineering.

[26]  Heiner Stuckenschmidt,et al.  RockIt: Exploiting Parallelism and Symmetry for MAP Inference in Statistical Relational Models , 2013, AAAI.

[27]  Diego Calvanese,et al.  The description logic handbook: theory , 2003 .

[28]  Willy Chen,et al.  Business-oriented CAx Integration with Semantic Technologies Revisited , 2010, GI Jahrestagung.

[29]  Diego Calvanese,et al.  The Description Logic Handbook: Theory, Implementation, and Applications , 2003, Description Logic Handbook.

[30]  John Mylopoulos,et al.  Requirements-Driven Root Cause Analysis Using Markov Logic Networks , 2012, CAiSE.

[31]  Michael I. Jordan,et al.  Failure diagnosis using decision trees , 2004 .

[32]  Lise Getoor,et al.  Lifted graphical models: a survey , 2011, Machine Learning.

[33]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[34]  Heiner Stuckenschmidt,et al.  Log-Linear Description Logics , 2011, IJCAI.

[35]  Daniel P. Siewiorek,et al.  High-availability computer systems , 1991, Computer.

[36]  Raymond J. Mooney,et al.  Abductive Markov Logic for Plan Recognition , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.

[37]  Raymond Mooney,et al.  Bayesian Abductive Logic Programs , 2010, StarAI@AAAI.

[38]  Heiner Stuckenschmidt,et al.  Root Cause Analysis through Abduction in Markov Logic Networks , 2016, 2016 IEEE 20th International Enterprise Distributed Object Computing Conference (EDOC).

[39]  Janno von Stülpnagel,et al.  Semantic Enterprise Architecture Management , 2013, ICEIS.

[40]  Edgar R. Weippl,et al.  Security Ontology: Simulating Threats to Corporate Assets , 2006, ICISS.