Issues in applying data mining to grid job failure detection and diagnosis

As grid computation systems become larger and more complex, manually diagnosing failures in jobs becomes impractical. Recently, machine-learning techniques have been proposed to detect a variety of application failures in grids. While this is a promising approach, there are many options as to how to apply machine learning to this problem, and it not always obvious which approaches are feasible or effective. We explore some issues that arise when we try to apply existing implementations of data mining algorithms to diagnose as well as predict job failures in grids. We demonstrate that a) it is feasible to gather enough data in real-time to train useful classifier algorithms, using only a small fraction of the grid's computational resources, b) it is important to choose the features used for classification with care, and c) it is useful to have both per-user and system-wide classifiers, as they diagnose different kinds of problems. We illustrate all these issues using a prototype system that runs over the Condor grid computation platform [3].