Online Failure Forecast for Fault-Tolerant Data Stream Processing

In this paper, we present a new online failure forecast system to achieve predictive failure management for fault-tolerant data stream processing. Different from previous reactive or proactive approaches, predictive failure management employs failure forecast to perform informed and just-in-time preventive actions on abnormal components only. We employ stream-based online learning methods to continuously classify runtime operator state into normal, alert, or failure, based on collected feature streams. We have implemented the online failure forecast system as part of the IBM system S stream processing system. Our experiments show that the on-line failure forecast system can achieve good prediction accuracy for a range of stream processing software failures, while imposing low overhead to the stream system.