CheckFreq: Frequent, Fine-Grained DNN Checkpointing
暂无分享,去创建一个
Amar Phanishayee | Vijay Chidambaram | Jayashree Mohan | Amar Phanishayee | Vijay Chidambaram | Jayashree Mohan
[1] Wencong Xiao,et al. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads , 2019, USENIX Annual Technical Conference.
[2] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.
[3] Tie-Yan Liu,et al. Convergence Analysis of Distributed Stochastic Gradient Descent with Shuffling , 2017, Neurocomputing.
[4] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[5] Franck Cappello,et al. DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models , 2020, 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID).
[6] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[7] Amar Phanishayee,et al. Analysis and Exploitation of Dynamic Pricing in the Public Cloud for ML Training , 2020 .
[8] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[9] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.
[10] Saurabh Gupta,et al. Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[11] Kang G. Shin,et al. Tiresias: A GPU Cluster Manager for Distributed Deep Learning , 2019, NSDI.
[12] Apostol Natsev,et al. YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.
[13] Zhiling Lan,et al. Adaptive Fault Management of Parallel Applications for High-Performance Computing , 2008, IEEE Transactions on Computers.
[14] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[15] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.
[17] Amar Phanishayee,et al. Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads , 2020, OSDI.
[18] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[19] Amar Phanishayee,et al. Themis: Fair and Efficient GPU Cluster Scheduling , 2020, NSDI.
[20] Zheng Zhang,et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.
[21] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[22] Wencong Xiao,et al. Gandiva: Introspective Cluster Scheduling for Deep Learning , 2018, OSDI.
[23] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[24] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[25] Zhuowen Tu,et al. Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[26] Ravishankar K. Iyer,et al. Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[27] Franck Cappello,et al. Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[28] Franck Cappello,et al. VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[29] Andrea C. Arpaci-Dusseau,et al. All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications , 2014, OSDI.
[30] Christian Engelmann,et al. Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[31] Gerhard Wellein,et al. An Evaluation of Different I/O Techniques for Checkpoint/Restart , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.