Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers

Transformers have transformed the field of natural language processing. Their superior performance is largely attributed to the use of stacked “self-attention” layers, each of which consists of matrix multiplies as well as softmax operations. As a result, unlike other neural networks, the softmax operation accounts for a significant fraction of the total run-time of Transformers. To address this, we propose Softermax, a hardware-friendly softmax design. Softermax consists of base replacement, low-precision softmax computations, and an online normalization calculation. We show Softermax results in 2.35x the energy efficiency at 0.90x the size of a comparable baseline, with negligible impact on network accuracy.

[1]  William J. Dally,et al.  MAGNet: A Modular Accelerator Generator for Neural Networks , 2019, 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[2]  Moshe Wasserblat,et al.  Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[3]  Zheng Zhang,et al.  Star-Transformer , 2019, NAACL.

[4]  Chao Tian,et al.  Efficient Softmax Hardware Architecture for Deep Neural Networks , 2019, ACM Great Lakes Symposium on VLSI.

[5]  Mehdi Rezagholizadeh,et al.  Fully Quantized Transformer for Machine Translation , 2020, EMNLP.

[6]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[7]  Danyang Zhu,et al.  Efficient Precision-Adjustable Architecture for Softmax Function in Deep Learning , 2020, IEEE Transactions on Circuits and Systems II: Express Briefs.

[8]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[9]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[10]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[11]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[12]  Sanchari Sen,et al.  Optimizing Transformers with Approximate Computing for Faster, Smaller and more Accurate NLP Models , 2020, ArXiv.

[13]  Fabrizio Lombardi,et al.  Design and Implementation of an Approximate Softmax Layer for Deep Neural Networks , 2020, 2020 IEEE International Symposium on Circuits and Systems (ISCAS).

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  Jingbo Zhu,et al.  Towards Fully 8-bit Integer Inference for the Transformer Model , 2020, IJCAI.

[16]  Natalia Gimelshein,et al.  Online normalizer calculation for softmax , 2018, ArXiv.

[17]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[18]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[19]  Patrick Judd,et al.  Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation , 2020, ArXiv.

[20]  Deog-Kyoon Jeong,et al.  A^3: Accelerating Attention Mechanisms in Neural Networks with Approximation , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[21]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.