论文信息 - K-TanH: Hardware Efficient Activations For Deep Learning - 字舞流文

K-TanH: Hardware Efficient Activations For Deep Learning

We propose K-TanH, a novel, highly accurate, hardware efficient approximation of popular activation function Tanh for Deep Learning. K-TanH consists of a sequence of parameterized bit/integer operations, such as, masking, shift and add/subtract (no floating point operation needed) where parameters are stored in a very small look-up table (bit-masking step can be eliminated). The design of K-TanH is flexible enough to deal with multiple numerical formats, such as, FP32 and BFloat16. High quality approximations to other activation functions, e.g., Swish and GELU, can be derived from K-TanH. We provide RTL design for K-TanH to demonstrate its area/power/performance efficacy. It is more accurate than existing piecewise approximations for Tanh. For example, K-TanH achieves $\sim 5\times$ speed up and $> 6\times$ reduction in maximum approximation error over software implementation of Hard TanH. Experimental results for low-precision BFloat16 training of language translation model GNMT on WMT16 data sets with approximate Tanh and Sigmoid obtained via K-TanH achieve similar accuracy and convergence as training with exact Tanh and Sigmoid.

Pradeep Dubey | Dipankar Das | Abhisek Kundu | Dhiraj Kalamkar | Kunal Banerjee | Sudarshan Srinivasan | Bharat Kaul | Eric C. Qin | Naveen K. Mellempudi

[1] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[2] Kevin Gimpel,et al. Gaussian Error Linear Units (GELUs) , 2016 .

[3] Quoc V. Le,et al. Searching for Activation Functions , 2018, arXiv.

[4] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[5] Alexander Heinecke,et al. Optimizing Deep Learning RNN Topologies on Intel Architecture , 2019, Supercomput. Front. Innov..

[6] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[7] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[8] Pradeep Dubey,et al. A Study of BFLOAT16 for Deep Learning Training , 2019, ArXiv.

[9] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[10] Stephen Marshall,et al. Activation Functions: Comparison of trends in Practice and Research for Deep Learning , 2018, ArXiv.

[11] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12] Geoffrey E. Hinton,et al. Deep Learning , 2015, Nature.

[13] Maxime Pelcat,et al. Why TanH is a Hardware Friendly Activation Function for CNNs , 2017, ICDSC.

[14] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Andrew McCallum,et al. Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[17] Geoffrey E. Hinton,et al. Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[18] Yann LeCun,et al. 1.1 Deep Learning Hardware: Past, Present, and Future , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[19] Vivienne Sze,et al. Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.