Towards a Scalable and Distributed Infrastructure for Deep Learning Applications
暂无分享,去创建一个
[1] Dhabaleswar K. Panda,et al. S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters , 2017, PPoPP.
[2] Patrick Diehl,et al. Asynchronous Execution of Python Code on Task-Based Runtime Systems , 2018, 2018 IEEE/ACM 4th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2).
[3] Наталія Ігорівна Муліна,et al. Programming language C , 2013 .
[4] Marc Snir,et al. Channel and filter parallelism for large-scale CNN training , 2019, SC.
[5] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.
[6] Michael J. Franklin,et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.
[7] Joel Nothman,et al. SciPy 1.0-Fundamental Algorithms for Scientific Computing in Python , 2019, ArXiv.
[8] Dirk Pflüger,et al. Harnessing billions of tasks for a scalable portable hydrodynamic simulation of the merger of two stars , 2019, Int. J. High Perform. Comput. Appl..
[9] Patrick Diehl,et al. An asynchronous and task-based implementation of peridynamics utilizing HPX—the C++ standard library for parallelism and concurrency , 2018, SN Applied Sciences.
[10] Wes McKinney,et al. pandas: a Foundational Python Library for Data Analysis and Statistics , 2011 .
[11] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.
[12] Alexander Aiken,et al. Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.
[13] Daniel Sunderland,et al. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..
[14] Hai Jin,et al. Exploiting potential of deep neural networks by layer-wise fine-grained parallelism , 2020, Future Gener. Comput. Syst..
[15] Bradford L. Chamberlain,et al. Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..
[16] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.
[17] Samy Bengio,et al. Revisiting Distributed Synchronous SGD , 2016, ArXiv.
[18] Gaël Varoquaux,et al. The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.
[19] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[20] Thomas Heller,et al. HPX – An open source C++ Standard Library for Parallelism and Concurrency , 2023, ArXiv.
[21] Jackson R. Mayo,et al. Implementing Software Resiliency in HPX for Extreme Scale Computing , 2020, ArXiv.
[22] Parsa Amini,et al. Assessing the Performance Impact of using an Active Global Address Space in HPX: A Case for AGAS , 2019, 2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM).
[23] Yibo Zhu,et al. A generic communication scheduler for distributed DNN training acceleration , 2019, SOSP.
[24] Thomas Heller,et al. Application of the ParalleX execution model to stencil-based problems , 2012, Computer Science - Research and Development.
[25] Shen Li,et al. PyTorch distributed , 2020, Proc. VLDB Endow..
[26] Dylan Malone Stuart,et al. Memory Requirements for Convolutional Neural Network Hardware Accelerators , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).
[27] Dietmar Fey,et al. Using HPX and LibGeoDecomp for scaling HPC applications on heterogeneous supercomputers , 2013, ScalA '13.
[28] Alex Bigelow,et al. Visualizing a Moving Target: A Design Study on Task Parallel Programs in the Presence of Evolving Data and Concerns , 2019, IEEE Transactions on Visualization and Computer Graphics.
[29] Alexander Aiken,et al. Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[30] Bo Liu,et al. FeCaffe: FPGA-enabled Caffe with OpenCL for Deep Learning Training and Inference on Intel Stratix 10 , 2020, FPGA.
[31] Olatunji Ruwase,et al. ZeRO: Memory Optimization Towards Training A Trillion Parameter Models , 2019, SC.
[32] Hossein Bobarshad,et al. HyperTune: Dynamic Hyperparameter Tuning for Efficient Distribution of DNN Training Over Heterogeneous Systems , 2020, 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD).
[33] Amit Agarwal,et al. CNTK: Microsoft's Open-Source Deep-Learning Toolkit , 2016, KDD.
[34] John D. Hunter,et al. Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.
[35] Di Kuang,et al. Comparative Study of Distributed Deep Learning Tools on Supercomputers , 2018, ICA3PP.
[36] S. Prudhomme,et al. Scheduling Optimization of Parallel Linear Algebra Algorithms Using Supervised Learning , 2019, 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC).
[37] Kostas Katrinis,et al. A taxonomy of task-based parallel programming technologies for high-performance computing , 2018, The Journal of Supercomputing.
[38] Fangfang Xia,et al. Performance, Power, and Scalability Analysis of the Horovod Implementation of the CANDLE NT3 Benchmark on the Cray XC40 Theta , 2018 .
[39] Dustin Tran,et al. Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.
[40] Christopher,et al. STEllAR-GROUP/hpx: HPX V1.1.0: The C++ Standards Library for Parallelism and Concurrency , 2018 .
[41] Dietmar Fey,et al. Higher-level parallelization for local and distributed asynchronous task-based programming , 2015, ESPM '15.
[42] Sangeetha Abdu Jyothi,et al. Communication Scheduling as a First-Class Citizen in Distributed Machine Learning Systems , 2018, ArXiv.
[43] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.
[44] Yao Zhang,et al. BigDL: A Distributed Deep Learning Framework for Big Data , 2019, SoCC.
[45] Anthony K. H. Tung,et al. SINGA: A Distributed Deep Learning Platform , 2015, ACM Multimedia.
[46] Katherine E. Isaacs,et al. JetLag: An Interactive, Asynchronous Array Computing Environment , 2020, PEARC.
[47] J. Beauchamp,et al. International Organization for Standardization (ISO) , 2015 .
[48] Allen D. Malony,et al. Runtime Adaptive Task Inlining on Asynchronous Multitasking Runtime Systems , 2019, ICPP.
[49] Takuya Akiba,et al. Chainer: A Deep Learning Framework for Accelerating the Research Cycle , 2019, KDD.
[50] Michael Cogswell,et al. Why M Heads are Better than One: Training a Diverse Ensemble of Deep Networks , 2015, ArXiv.
[51] Tara N. Sainath,et al. Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling , 2019, ArXiv.
[52] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.
[53] Nikhil R. Devanur,et al. PipeDream: Fast and Efficient Pipeline Parallel DNN Training , 2018, ArXiv.
[54] Thomas Hérault,et al. PaRSEC: Exploiting Heterogeneity to Enhance Scalability , 2013, Computing in Science & Engineering.
[55] Allen D. Malony,et al. An Autonomic Performance Environment for Exascale , 2015, Supercomput. Front. Innov..
[56] Hartmut Kaiser,et al. HPX: A Task Based Programming Model in a Global Address Space , 2014, PGAS.
[57] Alexander Sergeev,et al. Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.
[58] Andrew Lumsdaine,et al. A Comparative Study of Asynchronous Many-Tasking Runtimes: Cilk, Charm++, ParalleX and AM++ , 2019, ArXiv.
[59] Alex Krizhevsky,et al. One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.
[60] Dhabaleswar K. Panda,et al. HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow , 2020, ISC.
[61] Fan Zhou,et al. On the convergence properties of a K-step averaging stochastic gradient descent algorithm for nonconvex optimization , 2017, IJCAI.
[62] Kenta Oono,et al. Chainer : a Next-Generation Open Source Framework for Deep Learning , 2015 .
[63] Yang Wang,et al. BigDL: A Distributed Deep Learning Framework for Big Data , 2018, SoCC.
[64] Seung-Jong Park,et al. Evaluation of Deep Learning Frameworks Over Different HPC Architectures , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).
[65] Hossein Bobarshad,et al. Stannis: Low-Power Acceleration of DNN Training Using Computational Storage Devices , 2020, 2020 57th ACM/IEEE Design Automation Conference (DAC).
[66] Sangeetha Abdu Jyothi,et al. TicTac: Accelerating Distributed Deep Learning with Communication Scheduling , 2018, MLSys.
[67] Christos-Savvas Bouganis,et al. Caffe Barista: Brewing Caffe with FPGAs in the Training Loop , 2020, 2020 30th International Conference on Field-Programmable Logic and Applications (FPL).
[68] Pascal Bouvry,et al. Performance Analysis of Distributed and Scalable Deep Learning , 2020, 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID).
[69] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[70] Chuan Wu,et al. Preemptive All-reduce Scheduling for Expediting Distributed DNN Training , 2020, IEEE INFOCOM 2020 - IEEE Conference on Computer Communications.
[71] Hartmut Kaiser,et al. Methodology for Adaptive Active Message Coalescing in Task Based Runtime Systems , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[72] Thomas L. Sterling,et al. ParalleX An Advanced Parallel Execution Model for Scaling-Impaired Applications , 2009, 2009 International Conference on Parallel Processing Workshops.
[73] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[74] Ali R. Butt,et al. A Quantitative Study of Deep Learning Training on Heterogeneous Supercomputers , 2019, 2019 IEEE International Conference on Cluster Computing (CLUSTER).
[75] Laxmikant V. Kalé,et al. CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.
[76] Torsten Hoefler,et al. Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. , 2018 .
[77] Gang Chen,et al. SINGA: Putting Deep Learning in the Hands of Multimedia Users , 2015, ACM Multimedia.
[78] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.
[79] Marc Snir,et al. Improving Strong-Scaling of CNN Training by Exploiting Finer-Grained Parallelism , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[80] Amith R. Mamidala,et al. MXNET-MPI: Embedding MPI parallelism in Parameter Server Task Model for scaling Deep Learning , 2018, ArXiv.
[81] Dhabaleswar K. Panda,et al. Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation , 2018, 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).
[82] Olatunji Ruwase,et al. DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters , 2020, KDD.
[83] Patrick Diehl,et al. Closing the Performance Gap with Modern C , 2016, HiPC 2016.
[84] Forrest N. Iandola,et al. FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[85] Steven G. Parker,et al. Uintah: a massively parallel problem solving environment , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.
[86] Hossein Bobarshad,et al. STANNIS: Low-Power Acceleration of Deep Neural Network Training Using Computational Storage , 2020, ArXiv.
[87] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.