Towards a Scalable and Distributed Infrastructure for Deep Learning Applications

Although recent scaling up approaches to train deep neural networks have proven to be effective, the computational intensity of large and complex models, as well as the availability of large-scale datasets require deep learning frameworks to utilize scaling out techniques. Parallelization approaches and distribution requirements are not considered in the primary designs of most available distributed deep learning frameworks and most of them still are not able to perform effective and efficient fine-grained inter-node communication. We present Phylanx that has the potential to alleviate these shortcomings. Phylanx presents a productivity-oriented frontend where user Python code is translated to a futurized execution tree that can be executed efficiently on multiple nodes using the C++ standard library for parallelism and concurrency (HPX), leveraging fine-grained threading and an active messaging task-based runtime system.

[1]  Dhabaleswar K. Panda,et al.  S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters , 2017, PPoPP.

[2]  Patrick Diehl,et al.  Asynchronous Execution of Python Code on Task-Based Runtime Systems , 2018, 2018 IEEE/ACM 4th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2).

[3]  Наталія Ігорівна Муліна,et al.  Programming language C , 2013 .

[4]  Marc Snir,et al.  Channel and filter parallelism for large-scale CNN training , 2019, SC.

[5]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[6]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[7]  Joel Nothman,et al.  SciPy 1.0-Fundamental Algorithms for Scientific Computing in Python , 2019, ArXiv.

[8]  Dirk Pflüger,et al.  Harnessing billions of tasks for a scalable portable hydrodynamic simulation of the merger of two stars , 2019, Int. J. High Perform. Comput. Appl..

[9]  Patrick Diehl,et al.  An asynchronous and task-based implementation of peridynamics utilizing HPX—the C++ standard library for parallelism and concurrency , 2018, SN Applied Sciences.

[10]  Wes McKinney,et al.  pandas: a Foundational Python Library for Data Analysis and Statistics , 2011 .

[11]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[12]  Alexander Aiken,et al.  Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[13]  Daniel Sunderland,et al.  Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..

[14]  Hai Jin,et al.  Exploiting potential of deep neural networks by layer-wise fine-grained parallelism , 2020, Future Gener. Comput. Syst..

[15]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[16]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[17]  Samy Bengio,et al.  Revisiting Distributed Synchronous SGD , 2016, ArXiv.

[18]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[19]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[20]  Thomas Heller,et al.  HPX – An open source C++ Standard Library for Parallelism and Concurrency , 2023, ArXiv.

[21]  Jackson R. Mayo,et al.  Implementing Software Resiliency in HPX for Extreme Scale Computing , 2020, ArXiv.

[22]  Parsa Amini,et al.  Assessing the Performance Impact of using an Active Global Address Space in HPX: A Case for AGAS , 2019, 2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM).

[23]  Yibo Zhu,et al.  A generic communication scheduler for distributed DNN training acceleration , 2019, SOSP.

[24]  Thomas Heller,et al.  Application of the ParalleX execution model to stencil-based problems , 2012, Computer Science - Research and Development.

[25]  Shen Li,et al.  PyTorch distributed , 2020, Proc. VLDB Endow..

[26]  Dylan Malone Stuart,et al.  Memory Requirements for Convolutional Neural Network Hardware Accelerators , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[27]  Dietmar Fey,et al.  Using HPX and LibGeoDecomp for scaling HPC applications on heterogeneous supercomputers , 2013, ScalA '13.

[28]  Alex Bigelow,et al.  Visualizing a Moving Target: A Design Study on Task Parallel Programs in the Presence of Evolving Data and Concerns , 2019, IEEE Transactions on Visualization and Computer Graphics.

[29]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Bo Liu,et al.  FeCaffe: FPGA-enabled Caffe with OpenCL for Deep Learning Training and Inference on Intel Stratix 10 , 2020, FPGA.

[31]  Olatunji Ruwase,et al.  ZeRO: Memory Optimization Towards Training A Trillion Parameter Models , 2019, SC.

[32]  Hossein Bobarshad,et al.  HyperTune: Dynamic Hyperparameter Tuning for Efficient Distribution of DNN Training Over Heterogeneous Systems , 2020, 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD).

[33]  Amit Agarwal,et al.  CNTK: Microsoft's Open-Source Deep-Learning Toolkit , 2016, KDD.

[34]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[35]  Di Kuang,et al.  Comparative Study of Distributed Deep Learning Tools on Supercomputers , 2018, ICA3PP.

[36]  S. Prudhomme,et al.  Scheduling Optimization of Parallel Linear Algebra Algorithms Using Supervised Learning , 2019, 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC).

[37]  Kostas Katrinis,et al.  A taxonomy of task-based parallel programming technologies for high-performance computing , 2018, The Journal of Supercomputing.

[38]  Fangfang Xia,et al.  Performance, Power, and Scalability Analysis of the Horovod Implementation of the CANDLE NT3 Benchmark on the Cray XC40 Theta , 2018 .

[39]  Dustin Tran,et al.  Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.

[40]  Christopher,et al.  STEllAR-GROUP/hpx: HPX V1.1.0: The C++ Standards Library for Parallelism and Concurrency , 2018 .

[41]  Dietmar Fey,et al.  Higher-level parallelization for local and distributed asynchronous task-based programming , 2015, ESPM '15.

[42]  Sangeetha Abdu Jyothi,et al.  Communication Scheduling as a First-Class Citizen in Distributed Machine Learning Systems , 2018, ArXiv.

[43]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[44]  Yao Zhang,et al.  BigDL: A Distributed Deep Learning Framework for Big Data , 2019, SoCC.

[45]  Anthony K. H. Tung,et al.  SINGA: A Distributed Deep Learning Platform , 2015, ACM Multimedia.

[46]  Katherine E. Isaacs,et al.  JetLag: An Interactive, Asynchronous Array Computing Environment , 2020, PEARC.

[47]  J. Beauchamp,et al.  International Organization for Standardization (ISO) , 2015 .

[48]  Allen D. Malony,et al.  Runtime Adaptive Task Inlining on Asynchronous Multitasking Runtime Systems , 2019, ICPP.

[49]  Takuya Akiba,et al.  Chainer: A Deep Learning Framework for Accelerating the Research Cycle , 2019, KDD.

[50]  Michael Cogswell,et al.  Why M Heads are Better than One: Training a Diverse Ensemble of Deep Networks , 2015, ArXiv.

[51]  Tara N. Sainath,et al.  Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling , 2019, ArXiv.

[52]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[53]  Nikhil R. Devanur,et al.  PipeDream: Fast and Efficient Pipeline Parallel DNN Training , 2018, ArXiv.

[54]  Thomas Hérault,et al.  PaRSEC: Exploiting Heterogeneity to Enhance Scalability , 2013, Computing in Science & Engineering.

[55]  Allen D. Malony,et al.  An Autonomic Performance Environment for Exascale , 2015, Supercomput. Front. Innov..

[56]  Hartmut Kaiser,et al.  HPX: A Task Based Programming Model in a Global Address Space , 2014, PGAS.

[57]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[58]  Andrew Lumsdaine,et al.  A Comparative Study of Asynchronous Many-Tasking Runtimes: Cilk, Charm++, ParalleX and AM++ , 2019, ArXiv.

[59]  Alex Krizhevsky,et al.  One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[60]  Dhabaleswar K. Panda,et al.  HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow , 2020, ISC.

[61]  Fan Zhou,et al.  On the convergence properties of a K-step averaging stochastic gradient descent algorithm for nonconvex optimization , 2017, IJCAI.

[62]  Kenta Oono,et al.  Chainer : a Next-Generation Open Source Framework for Deep Learning , 2015 .

[63]  Yang Wang,et al.  BigDL: A Distributed Deep Learning Framework for Big Data , 2018, SoCC.

[64]  Seung-Jong Park,et al.  Evaluation of Deep Learning Frameworks Over Different HPC Architectures , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[65]  Hossein Bobarshad,et al.  Stannis: Low-Power Acceleration of DNN Training Using Computational Storage Devices , 2020, 2020 57th ACM/IEEE Design Automation Conference (DAC).

[66]  Sangeetha Abdu Jyothi,et al.  TicTac: Accelerating Distributed Deep Learning with Communication Scheduling , 2018, MLSys.

[67]  Christos-Savvas Bouganis,et al.  Caffe Barista: Brewing Caffe with FPGAs in the Training Loop , 2020, 2020 30th International Conference on Field-Programmable Logic and Applications (FPL).

[68]  Pascal Bouvry,et al.  Performance Analysis of Distributed and Scalable Deep Learning , 2020, 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID).

[69]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[70]  Chuan Wu,et al.  Preemptive All-reduce Scheduling for Expediting Distributed DNN Training , 2020, IEEE INFOCOM 2020 - IEEE Conference on Computer Communications.

[71]  Hartmut Kaiser,et al.  Methodology for Adaptive Active Message Coalescing in Task Based Runtime Systems , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[72]  Thomas L. Sterling,et al.  ParalleX An Advanced Parallel Execution Model for Scaling-Impaired Applications , 2009, 2009 International Conference on Parallel Processing Workshops.

[73]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Ali R. Butt,et al.  A Quantitative Study of Deep Learning Training on Heterogeneous Supercomputers , 2019, 2019 IEEE International Conference on Cluster Computing (CLUSTER).

[75]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[76]  Torsten Hoefler,et al.  Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. , 2018 .

[77]  Gang Chen,et al.  SINGA: Putting Deep Learning in the Hands of Multimedia Users , 2015, ACM Multimedia.

[78]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[79]  Marc Snir,et al.  Improving Strong-Scaling of CNN Training by Exploiting Finer-Grained Parallelism , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[80]  Amith R. Mamidala,et al.  MXNET-MPI: Embedding MPI parallelism in Parameter Server Task Model for scaling Deep Learning , 2018, ArXiv.

[81]  Dhabaleswar K. Panda,et al.  Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation , 2018, 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[82]  Olatunji Ruwase,et al.  DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters , 2020, KDD.

[83]  Patrick Diehl,et al.  Closing the Performance Gap with Modern C , 2016, HiPC 2016.

[84]  Forrest N. Iandola,et al.  FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[85]  Steven G. Parker,et al.  Uintah: a massively parallel problem solving environment , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[86]  Hossein Bobarshad,et al.  STANNIS: Low-Power Acceleration of Deep Neural Network Training Using Computational Storage , 2020, ArXiv.

[87]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.