Demystifying Developers' Issues in Distributed Training of Deep Learning Software

Deep learning (DL) has been pervasive in a wide spectrum of nowadays software systems and applications. The rich features of these DL based software applications (i.e., DL software) usually rely on powerful DL models. To train powerful DL models with large datasets efficiently, it has been a common practice for developers to parallelize and distribute the computation and memory over multiple devices in the training process, which is known as distributed training. However, existing efforts in the software engineering (SE) research community mainly focus on issues in the general process of training DL models. In contrast, to the best of our knowledge, issues that developers encounter in distributed training have never been well studied. Given the surging importance of distributed training in the current practice of developing DL software, this paper fills in the knowledge gap and presents the first comprehensive study on developers’ issues in distributed training. To this end, we extract and analyze 1,054 real-world developers’ issues in distributed training from Stack Overflow and GitHub, two commonly used data sources for studying software issues. We construct a fine-grained taxonomy consisting of 30 categories regarding the fault symptoms and summarize common fix patterns for different symptoms. Based on the results, we suggest actionable implications and research avenues that can potentially facilitate the future development of distributed training.

[1]  Junfeng Yang,et al.  DeepXplore: Automated Whitebox Testing of Deep Learning Systems , 2017, SOSP.

[2]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[3]  Olivier Pietquin,et al.  Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation , 2016, NIPS 2016.

[4]  Wencong Xiao,et al.  An Empirical Study on Program Failures of Deep Learning Jobs , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[5]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[6]  Stephanie Wang,et al.  Lineage stash: fault tolerance off the critical path , 2019, SOSP.

[7]  Shing-Chi Cheung,et al.  A comprehensive study of deep learning compiler bugs , 2021, ESEC/SIGSOFT FSE.

[8]  Chung-Hao Huang,et al.  Towards Dependability Metrics for Neural Networks , 2018, 2018 16th ACM/IEEE International Conference on Formal Methods and Models for System Design (MEMOCODE).

[9]  Yibo Zhu,et al.  A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters , 2020, OSDI.

[10]  Hridesh Rajan,et al.  A comprehensive study on deep learning bug characteristics , 2019, ESEC/SIGSOFT FSE.

[11]  Hridesh Rajan,et al.  Repairing Deep Neural Networks: Fix Patterns and Challenges , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[12]  Gang Huang,et al.  An empirical study on challenges of application development in serverless computing , 2021, ESEC/SIGSOFT FSE.

[13]  Alexander Aiken,et al.  Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[14]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[15]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[16]  Tao Xie,et al.  A comprehensive study on challenges in deploying deep learning based software , 2020, ESEC/SIGSOFT FSE.

[17]  Gabriele Bavota,et al.  Taxonomy of Real Faults in Deep Learning Systems , 2019, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[18]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[19]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[20]  Anthony Di Franco,et al.  A comprehensive study of real-world numerical bug characteristics , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[21]  Zhen Zhang,et al.  Is Network the Bottleneck of Distributed Training? , 2020, NetAI@SIGCOMM.

[22]  Lin Tan,et al.  CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[23]  Dong Wang,et al.  An empirical study on crash recovery bugs in large-scale distributed systems , 2018, ESEC/SIGSOFT FSE.

[24]  Torsten Hoefler,et al.  Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. , 2018 .

[25]  Ruben Mayer,et al.  Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools , 2019 .

[26]  Lu Zhang,et al.  Understanding build issue resolution in practice: symptoms and fix patterns , 2020, ESEC/SIGSOFT FSE.

[27]  Lei Ma,et al.  DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[28]  Yiling Lou,et al.  An Empirical Study on Deployment Faults of Deep Learning Based Mobile Applications , 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).

[29]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[30]  Yifan Chen,et al.  An empirical study on TensorFlow program bugs , 2018, ISSTA.

[31]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Wei Lin,et al.  A characteristic study on failures of production distributed data-parallel programs , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[34]  Carolyn B. Seaman,et al.  Qualitative Methods in Empirical Studies of Software Engineering , 1999, IEEE Trans. Software Eng..

[35]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[36]  Jianxiong Xiao,et al.  DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[37]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[38]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[39]  Gabriele Bavota,et al.  Software Documentation Issues Unveiled , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).