Online Model-Based Clustering for Crisis Identification in Distributed Computing

Large-scale distributed computing systems can suffer from occasional severe violation of performance goals; due to the complexity of these systems, manual diagnosis of the cause of the crisis is too slow to inform interventions taken during the crisis. Rapid automatic recognition of the recurrence of a problem can lead to cause diagnosis and informed intervention. We frame this as an online clustering problem, where the labels (causes) of some of the previous crises may be known. We give a fast and accurate solution using model-based clustering based on a Dirichlet process mixture; the evolution of each crisis is modeled as a multivariate time series. In the periods between crises we perform full Bayesian inference for the past crises, and as a new crisis occurs we apply fast approximate Bayesian updating. These inferences allow real-time expected-cost-minimizing decision making that fully accounts for uncertainty in the crisis labels and other parameters. We apply and validate our methods using simulated data and data from a production computing center with hundreds of servers running a 24/7 email-related application.

[1]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .

[2]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[3]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[4]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[5]  C. Geyer Markov Chain Monte Carlo Maximum Likelihood , 1991 .

[6]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[7]  Adrian F. M. Smith,et al.  Hierarchical Bayesian Analysis of Changepoint Problems , 1992 .

[8]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[9]  Christian P. Robert,et al.  The Bayesian choice , 1994 .

[10]  L. Tierney Markov Chains for Exploring Posterior Distributions , 1994 .

[11]  Bradley P. Carlin,et al.  Markov Chain Monte Carlo conver-gence diagnostics: a comparative review , 1996 .

[12]  M. Escobar Estimating Normal Means with a Dirichlet Process Prior , 1994 .

[13]  Jun S. Liu,et al.  The Collapsed Gibbs Sampler in Bayesian Computations with Applications to a Gene Regulation Problem , 1994 .

[14]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[15]  L. Wasserman,et al.  Computing Bayes Factors by Combining Simulation and Asymptotic Approximations , 1997 .

[16]  J. Pitman,et al.  The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator , 1997 .

[17]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[18]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[19]  D. Hand,et al.  Idiot's Bayes—Not So Stupid After All? , 2001 .

[20]  Masa-aki Sato,et al.  Online Model Selection Based on the Variational Bayes , 2001, Neural Computation.

[21]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[22]  H. Ishwaran,et al.  DIRICHLET PRIOR SIEVES IN FINITE NORMAL MIXTURES , 2002 .

[23]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[24]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[25]  Radford M. Neal,et al.  A Split-Merge Markov chain Monte Carlo Procedure for the Dirichlet Process Mixture Model , 2004 .

[26]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[27]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[28]  Yiming Yang,et al.  A Probabilistic Model for Online Document Clustering with Application to Novelty Detection , 2004, NIPS.

[29]  Armando Fox,et al.  Ensembles of models for automated diagnosis of system performance problems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[30]  A. Brix Bayesian Data Analysis, 2nd edn , 2005 .

[31]  Armando Fox,et al.  Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[32]  Adrian E. Raftery,et al.  Incremental Model-Based Clustering for Large Datasets With Small Clusters , 2005 .

[33]  Adrian E. Raftery,et al.  Donuts, scratches and blanks: robust model-based segmentation of microarray images , 2005, Bioinform..

[34]  Silke W. W. Rolles,et al.  Bayesian analysis for reversible Markov chains , 2006, math/0605582.

[35]  A. Raftery,et al.  Variable Selection for Model-Based Clustering , 2006 .

[36]  Wei-Ying Ma,et al.  Automated known problem diagnosis with event traces , 2006, EuroSys.

[37]  P. Green,et al.  Bayesian Model-Based Clustering Procedures , 2007 .

[38]  Shivnath Babu,et al.  Guided Problem Diagnosis through Active Learning , 2008, 2008 International Conference on Autonomic Computing.

[39]  Pietro Perona,et al.  Incremental learning of nonparametric Bayesian mixture models , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  G. Casella,et al.  Clustering using objective functions and stochastic search , 2008 .

[41]  Bradley P. Carlin,et al.  Bayesian Methods for Data Analysis , 2008 .

[42]  R. Maitra,et al.  Supplement to “ A k-mean-directions Algorithm for Fast Clustering of Data on the Sphere ” published in the Journal of Computational and Graphical Statistics , 2009 .

[43]  L. McCandless Bayesian methods for data analysis (3rd edn). Bradley P. Carlin and Thomas A. Louis, Chapman & Hall/CRC, Boca Raton, 2008. No. of pages: 552. Price: $69.95. ISBN 9781584886976 , 2009 .

[44]  J. Rosenthal,et al.  OPTIMAL SCALING OF METROPOLIS-COUPLED MARKOV CHAIN , 2009 .

[45]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[46]  Armando Fox,et al.  Fingerprinting the datacenter: automated classification of performance crises , 2010, EuroSys '10.

[47]  Gareth O. Roberts,et al.  Towards optimal scaling of metropolis-coupled Markov chain Monte Carlo , 2011, Stat. Comput..

[48]  D. Woodard,et al.  Conditions for Torpid Mixing of Parallel and Simulated Tempering on Multimodal Distributions , 2022 .