NVIDIA Tensor Core Programmability, Performance & Precision
暂无分享,去创建一个
Jeffrey S. Vetter | Erwin Laure | Stefano Markidis | Ivy Bo Peng | Steven Wei Der Chien | Steven W. D. Chien | S. Markidis | E. Laure | I. Peng | J. Vetter
[1] Nicholas J. Higham,et al. The Accuracy of Floating Point Summation , 1993, SIAM J. Sci. Comput..
[2] Julien Langou,et al. Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems , 2007, Int. J. High Perform. Comput. Appl..
[3] Luis A. Plana,et al. SpiNNaker: Mapping neural networks onto a massively-parallel chip multiprocessor , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).
[4] Jie Cheng,et al. Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..
[5] Alex Fit-Florea,et al. Precision and Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs , 2011 .
[6] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[7] Yoshua Bengio,et al. Training deep neural networks with low precision multiplications , 2014 .
[8] John Tran,et al. cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.
[9] Andrew S. Cassidy,et al. A million spiking-neuron integrated circuit with a scalable communication network and interface , 2014, Science.
[10] Yoshua Bengio,et al. Low precision storage for deep learning , 2014 .
[11] David Moloney,et al. Always-on Vision Processing Unit for Mobile Applications , 2015, IEEE Micro.
[12] Erwin Laure,et al. OpenACC acceleration of the Nek5000 spectral element code , 2015, Int. J. High Perform. Comput. Appl..
[13] Geoffrey E. Hinton,et al. Deep Learning , 2015, Nature.
[14] Pritish Narayanan,et al. Deep Learning with Limited Numerical Precision , 2015, ICML.
[15] Eric S. Chung,et al. A reconfigurable fabric for accelerating large-scale datacenter services , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).
[16] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.
[17] Michel Schanen,et al. On the Strong Scaling of the Spectral Element Solver Nek5000 on Petascale Systems , 2016, EASC.
[18] Alexander Heinecke,et al. LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[19] Cris Cecka,et al. Low Communication FMM-Accelerated FFT on GPUs , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[20] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[21] Jack J. Dongarra,et al. The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems , 2017, ICCS.
[22] Jack J. Dongarra,et al. Investigating half precision arithmetic to accelerate dense linear system solvers , 2017, ScalA@SC.
[23] Jack J. Dongarra,et al. Towards numerical benchmark for half-precision floating point arithmetic , 2017, 2017 IEEE High Performance Extreme Computing Conference (HPEC).
[24] Xin Wang,et al. Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks , 2017, NIPS.
[25] Hao Wu,et al. Mixed Precision Training , 2017, ICLR.