Near-Miss Analysis and the Availability of Software Systems

Software failures often result in unavailability of systems causing disasters ranging from financial loss to loss of lives. Preventing their recurrence is therefore absolutely necessary. To this end, a post-mortem investigation of a software failure is usually conducted to identify its root cause. However, these investigations most often lack efficiency and accuracy, as they are dependent on human expertise and level of knowledge of the system, and are therefore subjective in nature. Furthermore, investigating a software failure can be challenging due to the usually high volume of failure data such as log entries to be scrutinised. To address this problem, near-miss analysis is proposed. Near-miss analysis is an incident investigation technique that detects indicators of a likely failure before the failure unfolds. As these indicators – known as near misses – that are very close to the point of failure, they are most likely to point to its root cause. Near-miss analysis therefore offers an objective method to root-cause analysis based on the data collected from the near misses. The near-miss analysis method proposed in this paper is based on the pattern analysis of a software system’s behaviour close to a failure in order to identify near misses. The viability of the proposed method is demonstrated through an experiment.