论文信息 - Hound: Causal Learning for Datacenter-scale Straggler Diagnosis

Hound: Causal Learning for Datacenter-scale Straggler Diagnosis

Stragglers are exceptionally slow tasks within a job that delay its completion. Stragglers, which are uncommon within a single job, are pervasive in datacenters with many jobs. A large body of research has focused on mitigating datacenter stragglers, but relatively little research has focused on systematically and rigorously identifying their root causes. We present Hound, a statistical machine learning framework that infers the causes of stragglers from traces of datacenter-scale jobs. Hound is designed to achieve several objectives: datacenterscale diagnosis, interpretable models, unbiased inference, and computational efficiency. We demonstrate Hound’s capabilities for a production trace from Google’s warehouse-scale datacenters and two Spark traces from Amazon EC2 clusters.

Benjamin C. Lee | Pengfei Zheng

[1] J. Lunceford,et al. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study , 2004, Statistics in medicine.

[2] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[3] Gregory R. Ganger,et al. Diagnosing Performance Changes by Comparing Request Flows , 2011, NSDI.

[4] Gregory R. Ganger,et al. Ironmodel: robust performance models in the wild , 2008, SIGMETRICS '08.

[5] GhemawatSanjay,et al. The Google file system , 2003 .

[6] Yu Luo,et al. lprof: A Non-intrusive Request Flow Profiler for Distributed Systems , 2014, OSDI.

[7] Harald Steck,et al. Learning the Bayesian Network Structure: Dirichlet Prior versus Data , 2008, UAI 2008.

[8] B. Schweizer,et al. On Nonparametric Measures of Dependence for Random Variables , 1981 .

[9] Srinivasan Seshan,et al. Developing a predictive model of quality of experience for internet video , 2013, SIGCOMM.

[10] Gregory F. Cooper,et al. The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks , 1990, Artif. Intell..

[11] Leo Breiman,et al. Bagging Predictors , 1996, Machine Learning.

[12] Qi Zhao,et al. Towards automated performance diagnosis in a large IPTV network , 2009, SIGCOMM '09.

[13] Jeffrey S. Chase,et al. Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[14] Bill Ravens,et al. An Introduction to Copulas , 2000, Technometrics.

[15] David H. Wolpert,et al. No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[16] Joseph K. Bradley,et al. Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[17] Magdalena Balazinska,et al. Skew-resistant parallel processing of feature-extracting scientific user-defined functions , 2010, SoCC '10.

[18] Armando Fox,et al. HiLighter: Automatically Building Robust Signatures of Performance Behavior for Small- and Large-Scale Systems , 2008, SysML.

[19] Armando Fox,et al. Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[20] Joseph L. Hellerstein,et al. Obfuscatory obscanturism: Making workload traces of commercially-sensitive systems safe to release , 2012, 2012 IEEE Network Operations and Management Symposium.

[21] Luiz André Barroso,et al. The tail at scale , 2013, CACM.

[22] Sang Joon Kim,et al. A Mathematical Theory of Communication , 2006 .

[23] Eric A. Brewer,et al. Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[24] Stefan Szeider,et al. Algorithms and Complexity Results for Exact Bayesian Structure Learning , 2010, UAI.

[25] Armando Fox,et al. Ensembles of models for automated diagnosis of system performance problems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[26] Scott Shenker,et al. Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.

[27] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[28] Magdalena Balazinska,et al. SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[29] Randy H. Katz,et al. Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[30] Damaris Zurell,et al. Collinearity: a review of methods to deal with it and a simulation study evaluating their performance , 2013 .

[31] Lingjia Tang,et al. Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference , 2016, ISCA.

[32] Elias Bareinboim,et al. Controlling Selection Bias in Causal Inference , 2011, AISTATS.

[33] Armando Fox,et al. Fingerprinting the datacenter: automated classification of performance crises , 2010, EuroSys '10.

[34] Randy H. Katz,et al. Wrangler: Predictable and Faster Jobs using Fewer Resources , 2014, SoCC.

[35] Raghunath Othayoth Nambiar,et al. The making of TPC-DS , 2006, VLDB.

[36] H. Zou,et al. Regularization and variable selection via the elastic net , 2005 .

[37] Randy H. Katz,et al. Multi-Task Learning for Straggler Avoiding Predictive Job Scheduling , 2016, J. Mach. Learn. Res..

[38] Donald Beaver,et al. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[39] Adam Wierman,et al. Hopper: Decentralized Speculation-aware Cluster Scheduling at Scale , 2015, SIGCOMM.

[40] Yuqing Zhu,et al. BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[41] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .

[42] Lance M. Berc,et al. Continuous profiling: where have all the cycles gone? , 1997, ACM Trans. Comput. Syst..

[43] Rodrigo Fonseca,et al. Pivot tracing , 2018, USENIX ATC.

[44] Barnabás Póczos,et al. Copula-based Kernel Dependency Measures , 2012, ICML.

[45] Randy H. Katz,et al. X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[46] Albert G. Greenberg,et al. Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[47] Michael I. Jordan,et al. Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[48] Thomas F. Wenisch,et al. The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services , 2014, OSDI.

[49] Scott Shenker,et al. Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[50] Mirco Nanni,et al. Speeding-Up Hierarchical Agglomerative Clustering in Presence of Expensive Metrics , 2005, PAKDD.

[51] Salvatore J. Stolfo,et al. Experiments on multistrategy learning by meta-learning , 1993, CIKM '93.

[52] Jesús Muñoz,et al. Comparison of statistical methods commonly used in predictive modelling , 2004 .