Big Data architecture for IT incident management

IT incident management aims to restore normal service quality and availability of IT systems from interruptions. IT incidents often have complicated causes aggregated from an IT environment composed of thousands of interdependent components. Incident diagnosis then requires collecting and analyzing a large scale of data regarding these components, often, in real time to find suspect causes. It is extremely difficult to fulfill this requirement using traditional techniques. In this paper, we propose a new analysis architecture using Big Data techniques. This architecture leverages stream computing and MapReduce techniques to analyze data from various data sources, uses NoSQL databases to store incident-related documents and their relationships, and further utilizes other analytical techniques to examine the documents for root causes and failure prediction. We demonstrate this approach using a real-world example and present evaluation results from a recent pilot study.

[1]  Erik Brynjolfsson,et al.  Big data: the management revolution. , 2012, Harvard business review.

[2]  Juhnyoung Lee,et al.  IT Incident Management by Analyzing Incident Relations , 2012, ICSOC.

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  Michael Stonebraker,et al.  The 8 requirements of real-time stream processing , 2005, SGMD.

[5]  Michael Stonebraker,et al.  SQL databases v. NoSQL databases , 2010, CACM.

[6]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[7]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[8]  Philomina Simon,et al.  A Modified Directional Weighted Median Filter using second order difference based detection for impulse noise removal , 2011, 2011 Third International Conference on Advanced Computing.

[9]  Daniela Rosu,et al.  A service delivery platform for server management services , 2009, IBM J. Res. Dev..

[10]  Guan Le,et al.  Survey on NoSQL database , 2011, 2011 6th International Conference on Pervasive Computing and Applications.

[11]  Alain Biem,et al.  IBM infosphere streams for scalable, real-time, intelligent transportation services , 2010, SIGMOD Conference.

[12]  Geoffrey C. Fox,et al.  MapReduce in the Clouds for Science , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[13]  Larisa Shwartz,et al.  Towards an optimized model of incident ticket correlation , 2009, 2009 IFIP/IEEE International Symposium on Integrated Network Management.