论文信息 - Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers

Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers

Transformers have transformed the field of natural language processing. Their superior performance is largely attributed to the use of stacked “self-attention” layers, each of which consists of matrix multiplies as well as softmax operations. As a result, unlike other neural networks, the softmax operation accounts for a significant fraction of the total run-time of Transformers. To address this, we propose Softermax, a hardware-friendly softmax design. Softermax consists of base replacement, low-precision softmax computations, and an online normalization calculation. We show Softermax results in 2.35x the energy efficiency at 0.90x the size of a comparable baseline, with negligible impact on network accuracy.

[1] William J. Dally,et al. MAGNet: A Modular Accelerator Generator for Neural Networks , 2019, 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[2] Moshe Wasserblat,et al. Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[3] Zheng Zhang,et al. Star-Transformer , 2019, NAACL.

[4] Chao Tian,et al. Efficient Softmax Hardware Architecture for Deep Neural Networks , 2019, ACM Great Lakes Symposium on VLSI.

[5] Mehdi Rezagholizadeh,et al. Fully Quantized Transformer for Machine Translation , 2020, EMNLP.

[6] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[7] Danyang Zhu,et al. Efficient Precision-Adjustable Architecture for Softmax Function in Deep Learning , 2020, IEEE Transactions on Circuits and Systems II: Express Briefs.

[8] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[9] Thomas Wolf,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[10] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[11] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[12] Sanchari Sen,et al. Optimizing Transformers with Approximate Computing for Faster, Smaller and more Accurate NLP Models , 2020, ArXiv.

[13] Fabrizio Lombardi,et al. Design and Implementation of an Approximate Softmax Layer for Deep Neural Networks , 2020, 2020 IEEE International Symposium on Circuits and Systems (ISCAS).

[14] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15] Jingbo Zhu,et al. Towards Fully 8-bit Integer Inference for the Transformer Model , 2020, IJCAI.

[16] Natalia Gimelshein,et al. Online normalizer calculation for softmax , 2018, ArXiv.

[17] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[18] Mohammad Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[19] Patrick Judd,et al. Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation , 2020, ArXiv.

[20] Deog-Kyoon Jeong,et al. A^3: Accelerating Attention Mechanisms in Neural Networks with Approximation , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[21] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.