Combining supervised and unsupervised monitoring for fault detection in distributed computing systems

Fast and accurate fault detection is becoming an essential component of management software for mission critical systems. A good fault detector makes possible to initiate repair actions quickly, increasing the availability of the system. The contribution of this paper is twofold. First a new concept of supervised and unsupervised monitoring is proposed for system fault detection. We use a statistical method, canonical correlation analysis (CCA), to model the contextual dependencies between system inputs u and internal behavior x. By means of CCA, the space x is transformed into two subsets of variables, which are monitored in a supervised and unsupervised manner respectively. By doing so, our approach can reduce the false alarms resulting from unusual workload changes, and hence achieve high fault detection rate. Second, in order to test the performance of our approach, we simulate a variety of system faults in a real e-commerce application based on the multi-tiered J2EE architecture. Experimental results demonstrate that the CCA based approach can detect injected failures at their early stages when unusual phenomenon is very weak, and hence contribute to enormous time and cost savings in managing large scale distributed systems.