Clairvoyant Prefetching for Distributed Machine Learning I/O

I/O is emerging as a major bottleneck for machine learning training, especially in distributed environments such as clouds and supercomputers. Optimal data ingestion pipelines differ between systems, and increasing efficiency requires a delicate balance between access to local storage, external filesystems, and remote workers; yet existing frameworks fail to efficiently utilize such resources. We observe that, given the seed generating the random access pattern for training with SGD, we have clairvoyance and can exactly predict when a given sample will be accessed. We combine this with a theoretical analysis of access patterns in training and performance modeling to produce a novel machine learning I/O middleware, HDMLP, to tackle the I/O bottleneck. HDMLP provides an easy-to-use, flexible, and scalable solution that delivers better performance than state-of-the-art approaches while requiring very few changes to existing codebases and supporting a broad range of environments.

[1]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[2]  John Shalf,et al.  Tuning HDF5 for Lustre File Systems , 2010 .

[3]  Guojing Cong,et al.  Accelerating Data Loading in Deep Neural Network Training , 2019, 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC).

[4]  Anna R. Karlin,et al.  Near-Optimal Parallel Prefetching and Caching , 2000, SIAM J. Comput..

[5]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[6]  Pavan Balaji,et al.  Scalable Deep Learning via I/O Analysis and Optimization , 2019, TOPC.

[7]  Sem C. Borst,et al.  Distributed Caching Algorithms for Content Distribution Networks , 2010, 2010 Proceedings IEEE INFOCOM.

[8]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[10]  Jeffrey S. Vetter,et al.  NVIDIA Tensor Core Programmability, Performance & Precision , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[11]  Torsten Hoefler,et al.  Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. , 2018 .

[12]  Franklin Abodo,et al.  Detecting Work Zones in SHRP 2 NDS Videos Using Deep Learning Based Computer Vision , 2018, 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA).

[13]  Prabhat,et al.  CosmoFlow: Using Deep Learning to Learn the Universe at Scale , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Weikuan Yu,et al.  I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning , 2019, ICPP.

[15]  Christoph Ambühl Parallel prefetching and caching is NP-hard , 2003 .

[16]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018 .

[17]  Kishor S. Trivedi,et al.  A simple characterization of provably efficient prefetching algorithms , 2002, Proceedings International Conference on Dependable Systems and Networks.

[18]  Anna R. Karlin,et al.  A study of integrated prefetching and caching strategies , 1995, SIGMETRICS '95/PERFORMANCE '95.

[19]  Marc Snir,et al.  Aluminum: An Asynchronous, GPU-Aware Communication Library Optimized for Large-Scale Training of Deep Neural Networks on HPC Systems , 2018, 2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC).

[20]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Weikuan Yu,et al.  Entropy-Aware I/O Pipelining for Large-Scale Deep Learning on HPC Systems , 2018, 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).

[22]  Susanne Albers,et al.  Integrated prefetching and caching in single and parallel disk systems , 2003, SPAA '03.

[23]  Sam Ade Jacobs,et al.  Parallelizing Training of Deep Generative Models on Massive Scientific Datasets , 2019, 2019 IEEE International Conference on Cluster Computing (CLUSTER).

[24]  Yosuke Oyama,et al.  The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs With Hybrid Parallelism , 2020, IEEE Transactions on Parallel and Distributed Systems.

[25]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[26]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[27]  Kaiming He,et al.  Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[28]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[29]  Prabhat,et al.  Exascale Deep Learning for Climate Analytics , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[31]  Sai Narasimhamurthy,et al.  Characterizing Deep-Learning I/O Workloads in TensorFlow , 2018, 2018 IEEE/ACM 3rd International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS).

[32]  Susanne Albers,et al.  Minimizing Stall Time in Single and Parallel Disk Systems Using Multicommodity Network Flows , 2001, RANDOM-APPROX.

[33]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[34]  Rajeev Thakur,et al.  Data sieving and collective I/O in ROMIO , 1998, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[35]  Christian Koch,et al.  Category-aware hierarchical caching for video-on-demand content on youtube , 2018, MMSys.

[36]  Susanne Albers,et al.  Minimizing stall time in single and parallel disk systems , 1998, STOC '98.

[37]  Jianwei Li,et al.  Parallel netCDF: A High-Performance Scientific I/O Interface , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[38]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[39]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[40]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.