Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines

Input pipelines, which ingest and transform input data, are an essential part of training Machine Learning (ML) models. However, it is challenging to implement efficient input pipelines, as it requires reasoning about parallelism, asynchrony, and variability in fine-grained profiling information. Our analysis of over 2 million ML jobs in Google datacenters reveals that a significant fraction of model training jobs could benefit from faster input data pipelines. At the same time, our analysis reveals that most jobs do not saturate host hardware, pointing in the direction of software-based bottlenecks. Motivated by these findings, we propose Plumber, a tool for finding bottlenecks in ML input pipelines. Plumber uses an extensible and interprettable operational analysis analytical model to automatically tune parallelism, prefetching, and caching under host resource constraints. Across five representative ML pipelines, Plumber obtains speedups of up to 46× for misconfigured pipelines. By automating caching, Plumber obtains end-to-end speedups of over 40% compared to state-of-the-art tuners.

[1]  Raymond A. Lorie,et al.  XRM - An Extended (N-ary) Relational Memory , 1974, Research Report / G / IBM / Cambridge Scientific Center.

[2]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[3]  Amar Phanishayee,et al.  Analyzing and Mitigating Data Stalls in DNN Training , 2020, Proc. VLDB Endow..

[4]  Zuoning Yin,et al.  Monitoring and Debugging DryadLINQ Applications with Daphne , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[5]  Р Ю Чуйков,et al.  Обнаружение транспортных средств на изображениях загородных шоссе на основе метода Single shot multibox Detector , 2017 .

[6]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Scott Shenker,et al.  Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.

[8]  Morton M. Astrahan A history and evaluation of system R , 1981, Perform. Evaluation.

[9]  John E. Shore,et al.  The lazy repairman and other models: Performance collapse due to overhead in simple, single-server queuing systems , 1980, Performance.

[10]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[11]  Markus Püschel,et al.  Extending the roofline model: Bottleneck analysis with microarchitectural constraints , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[12]  George E. Dahl,et al.  Faster Neural Network Training with Data Echoing , 2019, ArXiv.

[13]  Kay Ousterhout,et al.  Monotasks: Architecting for Performance Clarity in Data Analytics Frameworks , 2017, SOSP.

[14]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[15]  Michael Isard,et al.  Optimus: a dynamic rewriting framework for data-parallel execution plans , 2013, EuroSys '13.

[16]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[17]  Query Optimization , 2009, Encyclopedia of Database Systems.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jiri Simsa,et al.  tf.data: A Machine Learning Data Processing Framework , 2021, Proc. VLDB Endow..

[20]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[21]  Irving L. Traiger,et al.  A history and evaluation of System R , 1981, CACM.

[22]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[23]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[24]  Christopher Olston,et al.  Automatic Optimization of Parallel Dataflow Programs , 2008, USENIX Annual Technical Conference.

[25]  Goetz Graefe,et al.  Volcano - An Extensible and Parallel Query Evaluation System , 1994, IEEE Trans. Knowl. Data Eng..

[26]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[27]  Edward D. Lazowska,et al.  Quantitative system performance - computer system analysis using queueing network models , 1983, Int. CMG Conference.

[28]  Virginia Smith,et al.  Progressive Compressed Records: Taking a Byte out of Deep Learning Data , 2019, Proc. VLDB Endow..

[29]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[30]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[31]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[32]  Peter J. Denning,et al.  The Operational Analysis of Queueing Network Models , 1978, CSUR.