Issues in applying data mining to grid job failure detection and diagnosis
暂无分享,去创建一个
As grid computation systems become larger and more complex, manually diagnosing failures in jobs becomes impractical. Recently, machine-learning techniques have been proposed to detect a variety of application failures in grids. While this is a promising approach, there are many options as to how to apply machine learning to this problem, and it not always obvious which approaches are feasible or effective. We explore some issues that arise when we try to apply existing implementations of data mining algorithms to diagnose as well as predict job failures in grids. We demonstrate that a) it is feasible to gather enough data in real-time to train useful classifier algorithms, using only a small fraction of the grid's computational resources, b) it is important to choose the features used for classification with care, and c) it is useful to have both per-user and system-wide classifiers, as they diagnose different kinds of problems. We illustrate all these issues using a prototype system that runs over the Condor grid computation platform [3].
[1] Thomas Fahringer,et al. Grid Application Fault Diagnosis Using Wrapper Services and Machine Learning , 2007, Int. J. Cooperative Inf. Syst..
[2] David A. Cieslak,et al. Short Paper: Troubleshooting Distributed Systems via Data Mining , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.
[3] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.
[4] 金田 重郎,et al. C4.5: Programs for Machine Learning (書評) , 1995 .