GradeML: Towards Holistic Performance Analysis for Machine Learning Workflows

Today, machine learning (ML) workloads are nearly ubiquitous. Over the past decade, much effort has been put into making ML model-training fast and efficient, e.g., by proposing new ML frameworks (such as TensorFlow, PyTorch), leveraging hardware support (TPUs, GPUs, FPGAs), and implementing new execution models (pipelines, distributed training). Matching this trend, considerable effort has also been put into performance analysis tools focusing on ML model-training. However, as we identify in this work, ML model training rarely happens in isolation and is instead one step in a larger ML workflow. Therefore, it is surprising that there exists no performance analysis tool that covers the entire life-cycle of ML workflows. Addressing this large conceptual gap, we envision in this work a holistic performance analysis tool for ML workflows. We analyze the state-of-practice and the state-of-the-art, presenting quantitative evidence about the performance of existing performance tools. We formulate our vision for holistic performance analysis of ML workflows along four design pillars: a unified execution model, lightweight collection of performance data, efficient data aggregation and presentation, and close integration in ML systems. Finally, we propose first steps towards implementing our vision as GradeML, a holistic performance analysis tool for ML workflows. Our preliminary work and experiments are open source at https://github.com/atlarge-research/grademl.

[1]  Bram van Ginneken,et al.  A survey on deep learning in medical image analysis , 2017, Medical Image Anal..

[2]  Wencong Xiao,et al.  Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads , 2019, USENIX Annual Technical Conference.

[3]  Xin Zhang,et al.  TFX: A TensorFlow-Based Production-Scale Machine Learning Platform , 2017, KDD.

[4]  Haoxiang Lin,et al.  G2: A Graph Processing System for Diagnosing Distributed Systems , 2011, USENIX Annual Technical Conference.

[5]  Gu-Yeon Wei,et al.  Benchmarking TPU, GPU, and CPU Platforms for Deep Learning , 2019, ArXiv.

[6]  Weiguo Liu,et al.  End-to-end I/O Monitoring on Leading Supercomputers , 2022, NSDI.

[7]  Vanish Talwar,et al.  VScope: Middleware for Troubleshooting Time-Sensitive Data Center Applications , 2012, Middleware.

[8]  Xin Wang,et al.  Clipper: A Low-Latency Online Prediction Serving System , 2016, NSDI.

[9]  Nikhil R. Devanur,et al.  PipeDream: Fast and Efficient Pipeline Parallel DNN Training , 2018, ArXiv.

[10]  Junsung Lim,et al.  MLOp Lifecycle Scheme for Vision-based Inspection Process in Manufacturing , 2019, OpML.

[11]  James Demmel,et al.  ImageNet Training in Minutes , 2017, ICPP.

[12]  Cees T. A. M. de Laat,et al.  A Medium-Scale Distributed System for Computer Science Research: Infrastructure for the Long Term , 2016, Computer.

[13]  Yolanda Gil,et al.  A 20-Year Community Roadmap for Artificial Intelligence Research in the US , 2019, ArXiv.

[14]  Valeriu Codreanu,et al.  DDLBench: Towards a Scalable Benchmarking Infrastructure for Distributed Deep Learning , 2020, 2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS).

[15]  Rajeev Gandhi,et al.  Theia: Visual Signatures for Problem Diagnosis in Large Hadoop Clusters , 2012, LISA.

[16]  Minsub Kim,et al.  Reducing tail latency of DNN-based recommender systems using in-storage processing , 2020, APSys.

[17]  Rodrigo Fonseca,et al.  Retro: Targeted Resource Management in Multi-tenant Distributed Systems , 2015, NSDI.

[18]  Chen Wang,et al.  FfDL: A Flexible Multi-tenant Deep Learning Platform , 2019, Middleware.

[19]  Yifan Wang,et al.  Challenges and Experiences with MLOps for Performance Diagnostics in Hybrid-Cloud Enterprise Software Deployments , 2020, OpML.

[20]  Lars Kotthoff,et al.  Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA , 2017, J. Mach. Learn. Res..

[21]  Ihab F. Ilyas,et al.  Data Cleaning: Overview and Emerging Challenges , 2016, SIGMOD Conference.

[22]  Harald C. Gall,et al.  Software Engineering for Machine Learning: A Case Study , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[23]  Wei Wang,et al.  Towards Framework-Independent, Non-Intrusive Performance Characterization for Dataflow Computation , 2019, APSys '19.

[24]  Mingfa Zhu,et al.  MIMP: Deadline and Interference Aware Scheduling of Hadoop Virtual Machines , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[25]  David Patterson,et al.  MLPerf Training Benchmark , 2019, MLSys.

[26]  Amar Phanishayee,et al.  Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training , 2020, USENIX Annual Technical Conference.

[27]  Guo Li,et al.  KungFu: Making Training in Distributed Machine Learning Adaptive , 2020, OSDI.

[28]  Tian Li,et al.  Ease.ml: Towards Multi-tenant Resource Sharing for Machine Learning Workloads , 2017, Proc. VLDB Endow..

[29]  Ali Ghodsi,et al.  Accelerating the Machine Learning Lifecycle with MLflow , 2018, IEEE Data Eng. Bull..

[30]  John K. Ousterhout,et al.  NanoLog: A Nanosecond Scale Logging System , 2018, USENIX Annual Technical Conference.

[31]  Jinjun Xiong,et al.  Across-Stack Profiling and Characterization of Machine Learning Models on GPUs , 2019, ArXiv.

[32]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[33]  Scott Shenker,et al.  Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.

[34]  Markus Weimer,et al.  Building Continuous Integration Services for Machine Learning , 2020, KDD.

[35]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[36]  Barton P. Miller,et al.  Diagnosing Distributed Systems with Self-propelled Instrumentation , 2008, Middleware.

[37]  Matthias S. Müller,et al.  The Vampir Performance Analysis Tool-Set , 2008, Parallel Tools Workshop.

[38]  Ruben Mayer,et al.  Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools , 2019 .

[39]  James Y. Zou,et al.  Data Shapley: Equitable Valuation of Data for Machine Learning , 2019, ICML.

[40]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[41]  A. Stephen McGough,et al.  Predicting the Computational Cost of Deep Learning Models , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[42]  Wanling Gao,et al.  HPC AI500: The Methodology, Tools, Roofline Performance Models, and Metrics for Benchmarking HPC AI Systems , 2020, ArXiv.

[43]  Tim Kraska,et al.  Northstar: An Interactive Data Science System , 2018, Proc. VLDB Endow..

[44]  Torsten Hoefler,et al.  A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[45]  Thomas W. Tucker,et al.  The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[46]  Ranga Raju Vatsavai,et al.  Multimodal Deep Learning Based Crop Classification Using Multispectral and Multitemporal Satellite Imagery , 2020, KDD.

[47]  Neoklis Polyzotis,et al.  Continuous Training for Production ML in the TensorFlow Extended (TFX) Platform , 2019, OpML.

[48]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[49]  Luping Wang,et al.  Metis: Learning to Schedule Long-Running Applications in Shared Container Clusters at Scale , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[50]  Sameh Elnikety,et al.  Model-Switching: Dealing with Fluctuating Workloads in Machine-Learning-as-a-Service Systems , 2020, HotCloud.

[51]  Alexandru Iosup,et al.  Grade10: A Framework for Performance Characterization of Distributed Graph Processing , 2020, 2020 IEEE International Conference on Cluster Computing (CLUSTER).

[52]  APSys '20: 11th ACM SIGOPS Asia-Pacific Workshop on Systems, Tsukuba, Japan, August 24-25, 2020 , 2020, APSys.

[53]  Yu Luo,et al.  Log20: Fully Automated Optimal Placement of Log Printing Statements under Specified Overhead Threshold , 2017, SOSP.

[54]  A. SalloumSaid,et al.  A survey of text mining in social media facebook and twitter perspectives , 2017 .

[55]  Xiaobo Zhou,et al.  Profiling distributed systems in lightweight virtualized environments with logs and resource metrics , 2018, HPDC.

[56]  Suman Jana,et al.  DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[57]  Newsha Ardalani,et al.  Beyond human-level accuracy: computational challenges in deep learning , 2019, PPoPP.

[58]  Kunle Olukotun,et al.  DAWNBench : An End-to-End Deep Learning Benchmark and Competition , 2017 .

[59]  Neoklis Polyzotis,et al.  Data Validation for Machine Learning , 2019, SysML.

[60]  Jie Huang,et al.  HiTune: Dataflow-Based Performance Analysis for Big Data Cloud , 2011, USENIX Annual Technical Conference.

[61]  Rodrigo Fonseca,et al.  Pivot tracing , 2018, USENIX ATC.

[62]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Yu Luo,et al.  Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle , 2016, OSDI.