K-TanH: Hardware Efficient Activations For Deep Learning

We propose K-TanH, a novel, highly accurate, hardware efficient approximation of popular activation function Tanh for Deep Learning. K-TanH consists of a sequence of parameterized bit/integer operations, such as, masking, shift and add/subtract (no floating point operation needed) where parameters are stored in a very small look-up table (bit-masking step can be eliminated). The design of K-TanH is flexible enough to deal with multiple numerical formats, such as, FP32 and BFloat16. High quality approximations to other activation functions, e.g., Swish and GELU, can be derived from K-TanH. We provide RTL design for K-TanH to demonstrate its area/power/performance efficacy. It is more accurate than existing piecewise approximations for Tanh. For example, K-TanH achieves $\sim 5\times$ speed up and $> 6\times$ reduction in maximum approximation error over software implementation of Hard TanH. Experimental results for low-precision BFloat16 training of language translation model GNMT on WMT16 data sets with approximate Tanh and Sigmoid obtained via K-TanH achieve similar accuracy and convergence as training with exact Tanh and Sigmoid.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[3]  Quoc V. Le,et al.  Searching for Activation Functions , 2018, arXiv.

[4]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[5]  Alexander Heinecke,et al.  Optimizing Deep Learning RNN Topologies on Intel Architecture , 2019, Supercomput. Front. Innov..

[6]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[7]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[8]  Pradeep Dubey,et al.  A Study of BFLOAT16 for Deep Learning Training , 2019, ArXiv.

[9]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[10]  Stephen Marshall,et al.  Activation Functions: Comparison of trends in Practice and Research for Deep Learning , 2018, ArXiv.

[11]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[13]  Maxime Pelcat,et al.  Why TanH is a Hardware Friendly Activation Function for CNNs , 2017, ICDSC.

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[17]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[18]  Yann LeCun,et al.  1.1 Deep Learning Hardware: Past, Present, and Future , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[19]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.