GradeML: Towards Holistic Performance Analysis for Machine Learning Workflows
暂无分享,去创建一个
Alexandru Iosup | Tim Hegeman | Animesh Trivedi | Matthijs Jansen | A. Trivedi | A. Iosup | Matthijs Jansen | T. Hegeman
[1] Bram van Ginneken,et al. A survey on deep learning in medical image analysis , 2017, Medical Image Anal..
[2] Wencong Xiao,et al. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads , 2019, USENIX Annual Technical Conference.
[3] Xin Zhang,et al. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform , 2017, KDD.
[4] Haoxiang Lin,et al. G2: A Graph Processing System for Diagnosing Distributed Systems , 2011, USENIX Annual Technical Conference.
[5] Gu-Yeon Wei,et al. Benchmarking TPU, GPU, and CPU Platforms for Deep Learning , 2019, ArXiv.
[6] Weiguo Liu,et al. End-to-end I/O Monitoring on Leading Supercomputers , 2022, NSDI.
[7] Vanish Talwar,et al. VScope: Middleware for Troubleshooting Time-Sensitive Data Center Applications , 2012, Middleware.
[8] Xin Wang,et al. Clipper: A Low-Latency Online Prediction Serving System , 2016, NSDI.
[9] Nikhil R. Devanur,et al. PipeDream: Fast and Efficient Pipeline Parallel DNN Training , 2018, ArXiv.
[10] Junsung Lim,et al. MLOp Lifecycle Scheme for Vision-based Inspection Process in Manufacturing , 2019, OpML.
[11] James Demmel,et al. ImageNet Training in Minutes , 2017, ICPP.
[12] Cees T. A. M. de Laat,et al. A Medium-Scale Distributed System for Computer Science Research: Infrastructure for the Long Term , 2016, Computer.
[13] Yolanda Gil,et al. A 20-Year Community Roadmap for Artificial Intelligence Research in the US , 2019, ArXiv.
[14] Valeriu Codreanu,et al. DDLBench: Towards a Scalable Benchmarking Infrastructure for Distributed Deep Learning , 2020, 2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS).
[15] Rajeev Gandhi,et al. Theia: Visual Signatures for Problem Diagnosis in Large Hadoop Clusters , 2012, LISA.
[16] Minsub Kim,et al. Reducing tail latency of DNN-based recommender systems using in-storage processing , 2020, APSys.
[17] Rodrigo Fonseca,et al. Retro: Targeted Resource Management in Multi-tenant Distributed Systems , 2015, NSDI.
[18] Chen Wang,et al. FfDL: A Flexible Multi-tenant Deep Learning Platform , 2019, Middleware.
[19] Yifan Wang,et al. Challenges and Experiences with MLOps for Performance Diagnostics in Hybrid-Cloud Enterprise Software Deployments , 2020, OpML.
[20] Lars Kotthoff,et al. Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA , 2017, J. Mach. Learn. Res..
[21] Ihab F. Ilyas,et al. Data Cleaning: Overview and Emerging Challenges , 2016, SIGMOD Conference.
[22] Harald C. Gall,et al. Software Engineering for Machine Learning: A Case Study , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).
[23] Wei Wang,et al. Towards Framework-Independent, Non-Intrusive Performance Characterization for Dataflow Computation , 2019, APSys '19.
[24] Mingfa Zhu,et al. MIMP: Deadline and Interference Aware Scheduling of Hadoop Virtual Machines , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.
[25] David Patterson,et al. MLPerf Training Benchmark , 2019, MLSys.
[26] Amar Phanishayee,et al. Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training , 2020, USENIX Annual Technical Conference.
[27] Guo Li,et al. KungFu: Making Training in Distributed Machine Learning Adaptive , 2020, OSDI.
[28] Tian Li,et al. Ease.ml: Towards Multi-tenant Resource Sharing for Machine Learning Workloads , 2017, Proc. VLDB Endow..
[29] Ali Ghodsi,et al. Accelerating the Machine Learning Lifecycle with MLflow , 2018, IEEE Data Eng. Bull..
[30] John K. Ousterhout,et al. NanoLog: A Nanosecond Scale Logging System , 2018, USENIX Annual Technical Conference.
[31] Jinjun Xiong,et al. Across-Stack Profiling and Characterization of Machine Learning Models on GPUs , 2019, ArXiv.
[32] Donald Beaver,et al. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .
[33] Scott Shenker,et al. Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.
[34] Markus Weimer,et al. Building Continuous Integration Services for Machine Learning , 2020, KDD.
[35] Alexander Sergeev,et al. Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.
[36] Barton P. Miller,et al. Diagnosing Distributed Systems with Self-propelled Instrumentation , 2008, Middleware.
[37] Matthias S. Müller,et al. The Vampir Performance Analysis Tool-Set , 2008, Parallel Tools Workshop.
[38] Ruben Mayer,et al. Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools , 2019 .
[39] James Y. Zou,et al. Data Shapley: Equitable Valuation of Data for Machine Learning , 2019, ICML.
[40] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.
[41] A. Stephen McGough,et al. Predicting the Computational Cost of Deep Learning Models , 2018, 2018 IEEE International Conference on Big Data (Big Data).
[42] Wanling Gao,et al. HPC AI500: The Methodology, Tools, Roofline Performance Models, and Metrics for Benchmarking HPC AI Systems , 2020, ArXiv.
[43] Tim Kraska,et al. Northstar: An Interactive Data Science System , 2018, Proc. VLDB Endow..
[44] Torsten Hoefler,et al. A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[45] Thomas W. Tucker,et al. The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[46] Ranga Raju Vatsavai,et al. Multimodal Deep Learning Based Crop Classification Using Multispectral and Multitemporal Satellite Imagery , 2020, KDD.
[47] Neoklis Polyzotis,et al. Continuous Training for Production ML in the TensorFlow Extended (TFX) Platform , 2019, OpML.
[48] Allen D. Malony,et al. The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..
[49] Luping Wang,et al. Metis: Learning to Schedule Long-Running Applications in Shared Container Clusters at Scale , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.
[50] Sameh Elnikety,et al. Model-Switching: Dealing with Fluctuating Workloads in Machine-Learning-as-a-Service Systems , 2020, HotCloud.
[51] Alexandru Iosup,et al. Grade10: A Framework for Performance Characterization of Distributed Graph Processing , 2020, 2020 IEEE International Conference on Cluster Computing (CLUSTER).
[52] APSys '20: 11th ACM SIGOPS Asia-Pacific Workshop on Systems, Tsukuba, Japan, August 24-25, 2020 , 2020, APSys.
[53] Yu Luo,et al. Log20: Fully Automated Optimal Placement of Log Printing Statements under Specified Overhead Threshold , 2017, SOSP.
[54] A. SalloumSaid,et al. A survey of text mining in social media facebook and twitter perspectives , 2017 .
[55] Xiaobo Zhou,et al. Profiling distributed systems in lightweight virtualized environments with logs and resource metrics , 2018, HPDC.
[56] Suman Jana,et al. DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).
[57] Newsha Ardalani,et al. Beyond human-level accuracy: computational challenges in deep learning , 2019, PPoPP.
[58] Kunle Olukotun,et al. DAWNBench : An End-to-End Deep Learning Benchmark and Competition , 2017 .
[59] Neoklis Polyzotis,et al. Data Validation for Machine Learning , 2019, SysML.
[60] Jie Huang,et al. HiTune: Dataflow-Based Performance Analysis for Big Data Cloud , 2011, USENIX Annual Technical Conference.
[61] Rodrigo Fonseca,et al. Pivot tracing , 2018, USENIX ATC.
[62] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[63] Yu Luo,et al. Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle , 2016, OSDI.