Hound: Causal Learning for Datacenter-scale Straggler Diagnosis

Stragglers are exceptionally slow tasks within a job that delay its completion. Stragglers, which are uncommon within a single job, are pervasive in datacenters with many jobs. A large body of research has focused on mitigating datacenter stragglers, but relatively little research has focused on systematically and rigorously identifying their root causes. We present Hound, a statistical machine learning framework that infers the causes of stragglers from traces of datacenter-scale jobs. Hound is designed to achieve several objectives: datacenterscale diagnosis, interpretable models, unbiased inference, and computational efficiency. We demonstrate Hound’s capabilities for a production trace from Google’s warehouse-scale datacenters and two Spark traces from Amazon EC2 clusters.

[1]  J. Lunceford,et al.  Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study , 2004, Statistics in medicine.

[2]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[3]  Gregory R. Ganger,et al.  Diagnosing Performance Changes by Comparing Request Flows , 2011, NSDI.

[4]  Gregory R. Ganger,et al.  Ironmodel: robust performance models in the wild , 2008, SIGMETRICS '08.

[5]  GhemawatSanjay,et al.  The Google file system , 2003 .

[6]  Yu Luo,et al.  lprof: A Non-intrusive Request Flow Profiler for Distributed Systems , 2014, OSDI.

[7]  Harald Steck,et al.  Learning the Bayesian Network Structure: Dirichlet Prior versus Data , 2008, UAI 2008.

[8]  B. Schweizer,et al.  On Nonparametric Measures of Dependence for Random Variables , 1981 .

[9]  Srinivasan Seshan,et al.  Developing a predictive model of quality of experience for internet video , 2013, SIGCOMM.

[10]  Gregory F. Cooper,et al.  The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks , 1990, Artif. Intell..

[11]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[12]  Qi Zhao,et al.  Towards automated performance diagnosis in a large IPTV network , 2009, SIGCOMM '09.

[13]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[14]  Bill Ravens,et al.  An Introduction to Copulas , 2000, Technometrics.

[15]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[16]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[17]  Magdalena Balazinska,et al.  Skew-resistant parallel processing of feature-extracting scientific user-defined functions , 2010, SoCC '10.

[18]  Armando Fox,et al.  HiLighter: Automatically Building Robust Signatures of Performance Behavior for Small- and Large-Scale Systems , 2008, SysML.

[19]  Armando Fox,et al.  Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[20]  Joseph L. Hellerstein,et al.  Obfuscatory obscanturism: Making workload traces of commercially-sensitive systems safe to release , 2012, 2012 IEEE Network Operations and Management Symposium.

[21]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[22]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[23]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[24]  Stefan Szeider,et al.  Algorithms and Complexity Results for Exact Bayesian Structure Learning , 2010, UAI.

[25]  Armando Fox,et al.  Ensembles of models for automated diagnosis of system performance problems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[26]  Scott Shenker,et al.  Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.

[27]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[28]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[29]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[30]  Damaris Zurell,et al.  Collinearity: a review of methods to deal with it and a simulation study evaluating their performance , 2013 .

[31]  Lingjia Tang,et al.  Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference , 2016, ISCA.

[32]  Elias Bareinboim,et al.  Controlling Selection Bias in Causal Inference , 2011, AISTATS.

[33]  Armando Fox,et al.  Fingerprinting the datacenter: automated classification of performance crises , 2010, EuroSys '10.

[34]  Randy H. Katz,et al.  Wrangler: Predictable and Faster Jobs using Fewer Resources , 2014, SoCC.

[35]  Raghunath Othayoth Nambiar,et al.  The making of TPC-DS , 2006, VLDB.

[36]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[37]  Randy H. Katz,et al.  Multi-Task Learning for Straggler Avoiding Predictive Job Scheduling , 2016, J. Mach. Learn. Res..

[38]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[39]  Adam Wierman,et al.  Hopper: Decentralized Speculation-aware Cluster Scheduling at Scale , 2015, SIGCOMM.

[40]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[41]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[42]  Lance M. Berc,et al.  Continuous profiling: where have all the cycles gone? , 1997, ACM Trans. Comput. Syst..

[43]  Rodrigo Fonseca,et al.  Pivot tracing , 2018, USENIX ATC.

[44]  Barnabás Póczos,et al.  Copula-based Kernel Dependency Measures , 2012, ICML.

[45]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[46]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[47]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[48]  Thomas F. Wenisch,et al.  The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services , 2014, OSDI.

[49]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[50]  Mirco Nanni,et al.  Speeding-Up Hierarchical Agglomerative Clustering in Presence of Expensive Metrics , 2005, PAKDD.

[51]  Salvatore J. Stolfo,et al.  Experiments on multistrategy learning by meta-learning , 1993, CIKM '93.

[52]  Jesús Muñoz,et al.  Comparison of statistical methods commonly used in predictive modelling , 2004 .

[53]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[54]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[55]  Xiao Zhang,et al.  CPI2: CPU performance isolation for shared compute clusters , 2013, EuroSys '13.

[56]  AmmarMostafa,et al.  Answering what-if deployment and configuration questions with wise , 2008 .

[57]  Peter Nobel,et al.  Practical performance models for complex, popular applications , 2010, SIGMETRICS '10.

[58]  Eshcar Hillel,et al.  Predicting Execution Bottlenecks in Map-Reduce Clusters , 2012, HotCloud.

[59]  Praveen K. Kopalle,et al.  The impact of collinearity on regression analysis: the asymmetric effect of negative and positive correlations , 2002 .

[60]  Reza Modarres,et al.  Measures of Dependence , 2011, International Encyclopedia of Statistical Science.

[61]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[62]  Jennifer Neville,et al.  Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems , 2012, NSDI.

[63]  Harry Zhang,et al.  A Fast Decision Tree Learning Algorithm , 2006, AAAI.

[64]  Gerard de Haan,et al.  Comparison of machine learning techniques for target detection , 2012, Artificial Intelligence Review.

[65]  Seunghak Lee,et al.  Exploiting Bounded Staleness to Speed Up Big Data Analytics , 2014, USENIX Annual Technical Conference.

[66]  D. Pregibon Resistant fits for some commonly used logistic models with medical application. , 1982, Biometrics.

[67]  Bernhard Schölkopf,et al.  The Randomized Dependence Coefficient , 2013, NIPS.

[68]  Jialin Li,et al.  Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency , 2014, SoCC.

[69]  Suzana de Siqueira Santos,et al.  A comparative study of statistical methods used to identify dependencies between gene expression signals , 2014, Briefings Bioinform..

[70]  Judea Pearl,et al.  Graphical Condition for Identification in recursive SEM , 2006, UAI.

[71]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[72]  Gang Ren,et al.  Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers , 2010, IEEE Micro.

[73]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[74]  Sheng Ma,et al.  Adaptive diagnosis in distributed systems , 2005, IEEE Transactions on Neural Networks.

[75]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .