暂无分享,去创建一个
Michael Anderson | Alexander Heinecke | Greg Henry | Michael J. Anderson | Sasikanth Avancha | Hans Pabst | Dhiraj Kalamkar | Kunal Banerjee | Evangelos Georganas | Anand Venkat | Dhiraj D. Kalamkar | Sasikanth Avancha | G. Henry | E. Georganas | A. Heinecke | Hans Pabst | Anand Venkat | K. Banerjee
[1] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[2] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.
[3] Marvin Minsky,et al. Perceptrons: An Introduction to Computational Geometry , 1969 .
[4] Patrice Y. Simard,et al. High Performance Convolutional Neural Networks for Document Processing , 2006 .
[5] Pradeep Dubey,et al. On Scale-out Deep Learning Training for Cloud and HPC , 2018, ArXiv.
[6] David Gregg,et al. Parallel Multi Channel convolution using General Matrix Multiplication , 2017, 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP).
[7] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.
[8] Alexander Heinecke,et al. Anatomy of High-Performance Deep Learning Convolutions on SIMD Architectures , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[9] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[10] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.
[11] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[12] Albert Cohen,et al. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.
[13] Jeffrey S. Vetter,et al. NVIDIA Tensor Core Programmability, Performance & Precision , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[14] Jack Dongarra,et al. MagmaDNN 0.2 High-Performance Data Analytics for Manycore GPUs and CPUs , 2019 .
[15] U. N. Niranjan,et al. Tensor Contractions with Extended BLAS Kernels on CPU and GPU , 2016, HiPC 2016.
[16] Rengan Xu,et al. Deep Learning at Scale on NVIDIA V100 Accelerators , 2018, 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).
[17] Tim Zerrell,et al. Stripe: Tensor Compilation via the Nested Polyhedral Model , 2019, ArXiv.
[18] Yann Le Cun,et al. A Theoretical Framework for Back-Propagation , 1988 .
[19] John Tran,et al. cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.
[20] Jack J. Dongarra,et al. The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems , 2017, ICCS.
[21] Heng-Tze Cheng,et al. Wide & Deep Learning for Recommender Systems , 2016, DLRS@RecSys.
[22] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[23] Alexander Sergeev,et al. Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.
[24] James Demmel,et al. Large-batch training for LSTM and beyond , 2019, SC.
[25] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.
[26] Minjia Zhang,et al. DeepCPU: Serving RNN-based Deep Learning Models 10x Faster , 2018, USENIX Annual Technical Conference.
[27] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.
[28] Vincent Vanhoucke,et al. Improving the speed of neural networks on CPUs , 2011 .
[29] Yida Wang,et al. Optimizing CNN Model Inference on CPUs , 2018, USENIX Annual Technical Conference.
[30] Jinyu Li,et al. Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks. , 2013, ICLR 2013.
[31] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[32] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[33] Alexander Heinecke,et al. LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[34] Michael Kruse,et al. High-Performance Generalized Tensor Operations , 2018, ACM Trans. Archit. Code Optim..
[35] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[36] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.
[37] Kunle Olukotun,et al. High-Accuracy Low-Precision Training , 2018, ArXiv.
[38] Bertrand A. Maher,et al. Glow: Graph Lowering Compiler Techniques for Neural Networks , 2018, ArXiv.
[39] David Gregg,et al. Low-memory GEMM-based convolution algorithms for deep neural networks , 2017, ArXiv.
[40] Kurt Hornik,et al. Multilayer feedforward networks are universal approximators , 1989, Neural Networks.