A Holistic Analysis of Datacenter Operations: Resource Usage, Energy, and Workload Characterization - Extended Technical Report

Improving datacenter operations is vital for the digital society. We posit that doing so requires our community to shift, from operational aspects taken in isolation to holistic analysis of datacenter resources, energy, and workloads. In turn, this shift will require new analysis methods, and open-access, FAIR datasets with fine temporal and spatial granularity. We leverage in this work one of the (rare) public datasets providing fine-grained information on datacenter operations. Using it, we show strong evidence that fine-grained information reveals new operational aspects. We then propose a method for holistic analysis of datacenter operations, providing statistical characterization of node, energy, and workload aspects. We demonstrate the benefits of our holistic analysis method by applying it to the operations of a datacenter infrastructure with over 300 nodes. Our analysis reveals both generic and ML-specific aspects, and further details how the operational behavior of the datacenter changed during the 2020 COVID-19 pandemic. We make over 30 main observations, providing holistic insight into the long-term operation of a large-scale, public scientific infrastructure. We suggest such observations can help immediately with performance engineering tasks such as predicting future datacenter load, and also long-term with the design of datacenter infrastructure.

[1]  W. Allcock,et al.  Job Characteristics on Large-Scale Systems: Long-Term Analysis, Quantification, and Implications , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Valeriu Codreanu,et al.  DDLBench: Towards a Scalable Benchmarking Infrastructure for Distributed Deep Learning , 2020, 2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS).

[3]  Ewa Deelman,et al.  WorkflowHub: Community Framework for Enabling Scientific Workflow Research and Development , 2020, 2020 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS).

[4]  Tirthak Patel,et al.  What does Power Consumption Behavior of HPC Jobs Reveal? : Demystifying, Quantifying, and Predicting Power Consumption Characteristics , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[5]  Mor Harchol-Balter,et al.  Borg: the next generation , 2020, EuroSys.

[6]  Ricardo Bianchini,et al.  Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider , 2020, USENIX Annual Technical Conference.

[7]  Eric Masanet,et al.  Recalibrating global data center energy-use estimates , 2020, Science.

[8]  Alexandru Iosup,et al.  Is Big Data Performance Reproducible in Modern Cloud Networks? , 2019, NSDI.

[9]  M. Schulz,et al.  DCDB Wintermute: Enabling Online and Holistic Operational Data Analytics on HPC Systems , 2019, HPDC.

[10]  Alexandru Iosup,et al.  The Workflow Trace Archive: Open-Access Data From Public and Private Computing Infrastructures , 2019, IEEE Transactions on Parallel and Distributed Systems.

[11]  Alexandru Iosup,et al.  Beneath the SURFace: An MRI-like View into the Life of a 21st-Century Datacenter , 2020, login Usenix Mag..

[12]  Akbar Siami Namin,et al.  The Performance of LSTM and BiLSTM in Forecasting Time Series , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[13]  Norman Bourassa,et al.  Operational Data Analytics: Optimizing the National Energy Research Scientific Computing Center Cooling Systems , 2019, ICPP Workshops.

[14]  Florina M. Ciorba,et al.  Anomaly Detection in High Performance Computers: A Vicinity Perspective , 2019, 2019 18th International Symposium on Parallel and Distributed Computing (ISPDC).

[15]  Wencong Xiao,et al.  Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads , 2019, USENIX Annual Technical Conference.

[16]  Michael M. Swift,et al.  Your Programmable NIC Should be a Programmable Switch , 2018, HotNets.

[17]  Shane Snyder,et al.  A Year in the Life of a Parallel File System , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Alexandru Iosup,et al.  A Reference Architecture for Datacenter Scheduling: Design, Validation, and Experiments , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Robert Ricci,et al.  Taming Performance Variability , 2018, OSDI.

[20]  Gregory R. Ganger,et al.  On the diversity of cluster workloads and its impact on research results , 2018, USENIX Annual Technical Conference.

[21]  Xiao Feng,et al.  Comparison and evaluation of air cooling and water cooling in resource consumption and economic performance , 2018, Energy.

[22]  John K. Ousterhout Always measure one level deeper , 2018, Commun. ACM.

[23]  Christian Haas,et al.  Predicting Amazon Spot Prices with LSTM Networks , 2018, ScienceCloud@HPDC.

[24]  David A. Patterson,et al.  Motivation for and Evaluation of the First Tensor Processing Unit , 2018, IEEE Micro.

[25]  Christa Boer,et al.  Correlation Coefficients: Appropriate Use and Interpretation , 2018, Anesthesia and analgesia.

[26]  Tian Li,et al.  Ease.ml: Towards Multi-tenant Resource Sharing for Machine Learning Workloads , 2017, Proc. VLDB Endow..

[27]  Robert B. Ross,et al.  Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems , 2018, FAST.

[28]  Devesh Tiwari,et al.  GUIDE: A Scalable Information Directory Service to Collect, Federate, and Analyze Logs for Operational Insights into a Leadership HPC Facility , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  Andrea Rosà,et al.  Failure Analysis and Prediction for Big-Data Systems , 2017, IEEE Transactions on Services Computing.

[30]  Ricardo Bianchini,et al.  Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms , 2017, SOSP.

[31]  Claudia Canali,et al.  Identifying Communication Patterns between Virtual Machines in Software-Defined Data Centers , 2017, PERV.

[32]  Kang G. Shin,et al.  Efficient Memory Disaggregation with Infiniswap , 2017, NSDI.

[33]  Yu Luo,et al.  Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle , 2016, OSDI.

[34]  Chunming Rong,et al.  Using Spearman's correlation coefficients for exploratory data analysis on big dataset , 2015, Concurr. Comput. Pract. Exp..

[35]  Alexandru Uta,et al.  Towards Resource Disaggregation — Memory Scavenging for Scientific Workloads , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[36]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[37]  Yonggang Wen,et al.  Data Center Energy Consumption Modeling: A Survey , 2016, IEEE Communications Surveys & Tutorials.

[38]  Erik Elmroth,et al.  Performance Anomaly Detection and Bottleneck Identification , 2015, ACM Comput. Surv..

[39]  Andrea Rosà,et al.  Predicting and Mitigating Jobs Failures in Big Data Clusters , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[40]  Alexandru Iosup,et al.  Statistical Characterization of Business-Critical Workloads Hosted in Cloud Datacenters , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[41]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[42]  Ewa Deelman,et al.  Community Resources for Enabling Research in Distributed Scientific Workflows , 2014, 2014 IEEE 10th International Conference on e-Science.

[43]  Dan Tsafrir,et al.  Experience with using the Parallel Workloads Archive , 2014, J. Parallel Distributed Comput..

[44]  Calton Pu,et al.  vPerfGuard: an automated model-driven framework for application performance diagnosis in consolidated cloud environments , 2013, ICPE '13.

[45]  David Atienza,et al.  Correlation-aware virtual machine allocation for energy-efficient datacenters , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[46]  Evgenia Smirni,et al.  Data Centers in the Cloud: A Large Scale Performance Study , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[47]  Roy H. Campbell,et al.  Two Sides of a Coin: Optimizing the Schedule of MapReduce Jobs to Minimize Their Makespan and Improve Cluster Performance , 2012, 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[48]  Bianca Schroeder,et al.  Temperature management in data centers: why some (might) like it hot , 2012, SIGMETRICS '12.

[49]  Eric Bouwers,et al.  Getting What You Measure , 2012, Commun. ACM.

[50]  Jordi Torres,et al.  GreenSlot: Scheduling energy consumption in green datacenters , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[51]  Fang Liu,et al.  NIST Cloud Computing Reference Architecture , 2011, 2011 IEEE World Congress on Services.

[52]  Benjamin Hindman,et al.  Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[53]  Hong Liu,et al.  Fiber optic communication technologies: What's needed for datacenter network operations , 2010, IEEE Communications Magazine.

[54]  Yasushi Inoguchi,et al.  Performance evaluation of a Green Scheduling Algorithm for energy savings in Cloud computing , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[55]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[56]  Iosif Legrand,et al.  Monitoring and control of large systems with MonALISA , 2009, CACM.

[57]  Sally A. McKee,et al.  Machine learning based online performance prediction for runtime parallelization and task scheduling , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[58]  Alexandru Iosup,et al.  The Grid Workloads Archive , 2008, Future Gener. Comput. Syst..

[59]  Robert D. van der Mei,et al.  A prediction method for job runtimes on shared processors: Survey, statistical analysis and new avenues , 2007, Perform. Evaluation.

[60]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[61]  Richard Wolski,et al.  Multivariate Resource Performance Forecasting in the Network Weather Service , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[62]  Warren Smith,et al.  Benchmarks and Standards for the Evaluation of Parallel Job Schedulers , 1999, JSSPP.