暂无分享,去创建一个
[1] Torsten Hoefler,et al. Near-global climate simulation at 1 km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0 , 2017 .
[2] Nikhil R. Devanur,et al. PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.
[3] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[4] Mary W. Hall,et al. SWIRL: High-performance many-core CPU code generation for deep neural networks , 2019, Int. J. High Perform. Comput. Appl..
[5] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.
[6] Takuya Akiba,et al. Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes , 2017, ArXiv.
[7] Torsten Hoefler,et al. Stateful dataflow multigraphs: a data-centric model for performance portability on heterogeneous architectures , 2019, SC.
[8] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[9] James Demmel,et al. ImageNet Training in Minutes , 2017, ICPP.
[10] Patrice Y. Simard,et al. High Performance Convolutional Neural Networks for Document Processing , 2006 .
[11] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.
[12] Samy Bengio,et al. Device Placement Optimization with Reinforcement Learning , 2017, ICML.
[13] Michel Steuwer,et al. LIFT: A functional data-parallel IR for high-performance GPU code generation , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[14] Phil Blunsom,et al. Recurrent Continuous Translation Models , 2013, EMNLP.
[15] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.
[16] Dong Yu,et al. Pipelined BackPropagation for Context-Dependent Deep Neural Networks , 2012 .
[17] Mohammad Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.
[18] Hiroaki Mikami,et al. Massively Distributed SGD: ImageNet/ResNet-50 Training in a Flash , 2018 .
[19] Lukasz Kaiser,et al. Reformer: The Efficient Transformer , 2020, ICLR.
[20] Lav R. Varshney,et al. CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.
[21] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.
[22] Kurt Keutzer,et al. Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization , 2019, MLSys.
[23] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[24] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.
[25] Lane Schwartz,et al. DLVM: A modern compiler infrastructure for deep learning systems , 2017, ICLR.
[26] André F. T. Martins,et al. Adaptively Sparse Transformers , 2019, EMNLP.
[27] Andrew Lavin,et al. Fast Algorithms for Convolutional Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[28] Lidong Zhou,et al. Astra: Exploiting Predictability to Optimize Deep Learning , 2019, ASPLOS.
[29] Jacek Tabor,et al. Molecule Attention Transformer , 2020, ArXiv.
[30] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.
[31] Dustin Tran,et al. Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.
[32] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[33] Tao Wang,et al. Image Classification at Supercomputer Scale , 2018, ArXiv.
[34] Tianqi Chen,et al. Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.
[35] Benoît Meister,et al. Polyhedral Optimization of TensorFlow Computation Graphs , 2017, ESPT/VPA@SC.
[36] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[37] Nam Sung Kim,et al. Pipe-SGD: A Decentralized Pipelined SGD Framework for Distributed Deep Net Training , 2018, NeurIPS.
[38] Alexander Aiken,et al. Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.
[39] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[40] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[41] Sergio Gomez Colmenarejo,et al. TF-Replicator: Distributed Machine Learning for Researchers , 2019, ArXiv.
[42] Ashish Vaswani,et al. Stand-Alone Self-Attention in Vision Models , 2019, NeurIPS.
[43] Hao Wu,et al. Mixed Precision Training , 2017, ICLR.
[44] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[45] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.
[46] Dustin Tran,et al. Image Transformer , 2018, ICML.
[47] Jonathan S. Rosenfeld,et al. A Constructive Prediction of the Generalization Error Across Scales , 2020, ICLR.
[48] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.
[49] Michael Carbin,et al. TIRAMISU: A Polyhedral Compiler for Dense and Sparse Deep Learning , 2020, ArXiv.
[50] Marc Snir,et al. Improving Strong-Scaling of CNN Training by Exploiting Finer-Grained Parallelism , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[51] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.
[52] Hariharan Sandanagobalane,et al. Diesel: DSL for linear algebra and neural net computations on GPUs , 2018, MAPL@PLDI.
[53] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.
[54] Geoffrey E. Hinton,et al. Dynamic Routing Between Capsules , 2017, NIPS.
[55] H. T. Kung,et al. I/O complexity: The red-blue pebble game , 1981, STOC '81.
[56] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..
[57] Yi Yang,et al. Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[58] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.
[59] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..
[60] Matei Zaharia,et al. Optimizing DNN Computation with Relaxed Graph Substitutions , 2019, MLSys.
[61] Siu Cheung Hui,et al. Simple and Effective Curriculum Pointer-Generator Networks for Reading Comprehension over Long Narratives , 2019, ACL.
[62] Marc Snir,et al. Channel and filter parallelism for large-scale CNN training , 2019, SC.
[63] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.
[64] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.
[65] Matthew Johnson,et al. Compiling machine learning programs via high-level tracing , 2018 .
[66] Noam Shazeer,et al. Fast Transformer Decoding: One Write-Head is All You Need , 2019, ArXiv.
[67] R'emi Louf,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.
[68] Lei Liu,et al. Acorns: A Framework for Accelerating Deep Neural Networks with Input Sparsity , 2019, 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[69] Edouard Grave,et al. Adaptive Attention Span in Transformers , 2019, ACL.
[70] Raquel Urtasun,et al. The Reversible Residual Network: Backpropagation Without Storing Activations , 2017, NIPS.
[71] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.
[72] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .
[73] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[74] Uday Bondhugula,et al. MLIR: A Compiler Infrastructure for the End of Moore's Law , 2020, ArXiv.
[75] Trishul M. Chilimbi,et al. Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.
[76] Albert Cohen,et al. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.
[77] Yann LeCun,et al. Fast Training of Convolutional Networks through FFTs , 2013, ICLR.
[78] Martin Jaggi,et al. On the Relationship between Self-Attention and Convolutional Layers , 2019, ICLR.
[79] Christopher D. Manning,et al. Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.
[80] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[81] Daniele Paolo Scarpazza,et al. Dissecting the Graphcore IPU Architecture via Microbenchmarking , 2019, ArXiv.
[82] Kjell Schubert,et al. Transformer-Transducer: End-to-End Speech Recognition with Self-Attention , 2019, ArXiv.
[83] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018 .
[84] Uday Bondhugula,et al. Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model , 2008, CC.
[85] Hai Liu,et al. Latte: a language, compiler, and runtime for elegant and efficient deep neural networks , 2016, PLDI.
[86] Percy Liang,et al. Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.
[87] Guillaume Lample,et al. Deep Learning for Symbolic Mathematics , 2019, ICLR.
[88] Liu Yang,et al. Sparse Sinkhorn Attention , 2020, ICML.
[89] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.
[90] James Demmel,et al. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.
[91] Alexander Aiken,et al. TASO: optimizing deep learning computation with automatic generation of graph substitutions , 2019, SOSP.
[92] Pierre Vandergheynst,et al. Geometric Deep Learning: Going beyond Euclidean data , 2016, IEEE Signal Process. Mag..
[93] Hyojin Kim,et al. LBANN: livermore big artificial neural network HPC toolkit , 2015, MLHPC@SC.
[94] Torsten Hoefler,et al. Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. , 2018 .
[95] D. Scott Cyphers,et al. Intel® nGraphTM , 2018 .
[96] Olatunji Ruwase,et al. ZeRO: Memory Optimization Towards Training A Trillion Parameter Models , 2019, SC.
[97] Geoffrey Zweig,et al. Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.
[98] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.
[99] Torsten Hoefler,et al. Accelerating Deep Learning Frameworks with Micro-Batches , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).
[100] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.
[101] Guillermo Sapiro,et al. Deep learning? , 1999 .
[102] Christian Lengauer,et al. Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation , 2012, Parallel Process. Lett..
[103] John Shalf,et al. Trends in Data Locality Abstractions for HPC Systems , 2017, IEEE Transactions on Parallel and Distributed Systems.
[104] Kurt Keutzer,et al. Integrated Model, Batch, and Domain Parallelism in Training Neural Networks , 2017, SPAA.
[105] Fedor Moiseev,et al. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.
[106] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.
[107] John Tran,et al. cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.
[108] Quoc V. Le,et al. Attention Augmented Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[109] Dan Klein,et al. Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers , 2020, ArXiv.
[110] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.
[111] Razvan Pascanu,et al. Stabilizing Transformers for Reinforcement Learning , 2019, ICML.
[112] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[113] Chengqi Zhang,et al. Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling , 2018, ICLR.
[114] Andrew McCallum,et al. Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.
[115] Tao Wang,et al. Deep learning with COTS HPC systems , 2013, ICML.
[116] Alexander Aiken,et al. Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[117] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[118] Torsten Hoefler,et al. A data-centric approach to extreme-scale ab initio dissipative quantum transport simulations , 2019, SC.
[119] Bertrand A. Maher,et al. Glow: Graph Lowering Compiler Techniques for Neural Networks , 2018, ArXiv.
[120] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.