Toward autonomic grids: analyzing the job flow with affinity streaming

The Affinity Propagation (AP) clustering algorithm proposed by Frey and Dueck (2007) provides an understandable, nearly optimal summary of a dataset, albeit with quadratic computational complexity. This paper, motivated by Autonomic Computing, extends AP to the data streaming framework. Firstly a hierarchical strategy is used to reduce the complexity to O(N1+ε); the distortion loss incurred is analyzed in relation with the dimension of the data items. Secondly, a coupling with a change detection test is used to cope with non-stationary data distribution, and rebuild the model as needed. The presented approach StrAP is applied to the stream of jobs submitted to the EGEE Grid, providing an understandable description of the job flow and enabling the system administrator to spot online some sources of failures.

[1]  J Andreeva,et al.  Dashboard for the LHC experiments , 2008 .

[2]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[3]  Michèle Sebag,et al.  Data Streaming with Affinity Propagation , 2008, ECML/PKDD.

[4]  Zaïd Harchaoui,et al.  Kernel Change-point Analysis , 2008, NIPS.

[5]  Eric Walter,et al.  Global optimization of expensive-to-evaluate functions: an empirical comparison of two sampling criteria , 2009, J. Glob. Optim..

[6]  E. S. Page CONTINUOUS INSPECTION SCHEMES , 1954 .

[7]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[8]  Sheng Ma,et al.  Adaptive diagnosis in distributed systems , 2005, IEEE Transactions on Neural Networks.

[9]  Ran Wolff,et al.  Mining for misconfigured machines in grid systems , 2006, KDD '06.

[10]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[11]  Anil K. Jain,et al.  Large-Scale Parallel Data Clustering , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  João Gama,et al.  Accurate decision trees for mining high-speed data streams , 2003, KDD '03.

[13]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[14]  Graham Cormode,et al.  Conquering the Divide: Continuous Clustering of Distributed Data Streams , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[15]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[16]  S. Mallat,et al.  Matching pursuit of images , 1994, Proceedings of IEEE-SP International Symposium on Time- Frequency and Time-Scale Analysis.

[17]  Marina Meila,et al.  The uniqueness of a good optimum for K-means , 2006, ICML.

[18]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[19]  S. Muthukrishnan,et al.  Sequential Change Detection on Data Streams , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[20]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[21]  Michele Leone,et al.  Clustering by Soft-constraint Affinity Propagation: Applications to Gene-expression Data , 2022 .

[22]  João Gama,et al.  Stream-Based Electricity Load Forecast , 2007, PKDD.

[23]  D. Hinkley Inference about the change-point from cumulative sum tests , 1971 .