Efficient Softmax Approximation for Deep Neural Networks with Attention Mechanism

There has been a rapid advance of custom hardware (HW) for accelerating the inference speed of deep neural networks (DNNs). Previously, the softmax layer was not a main concern of DNN accelerating HW, because its portion is relatively small in multi-layer perceptron or convolutional neural networks. However, as the attention mechanisms are widely used in various modern DNNs, a cost-efficient implementation of softmax layer is becoming very important. In this paper, we propose two methods to approximate softmax computation, which are based on the usage of LookUp Tables (LUTs). The required size of LUT is quite small (about 700 Bytes) because ranges of numerators and denominators of softmax are stable if normalization is applied to the input. We have validated the proposed technique over different AI tasks (object detection, machine translation, sentiment analysis, and semantic equivalence) and DNN models (DETR, Transformer, BERT) by a variety of benchmarks (COCO17, WMT14, WMT17, GLUE). We showed that 8-bit approximation allows to obtain acceptable accuracy loss below 1.0%.

[1]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[2]  Weisong Shi,et al.  OpenEI: An Open Framework for Edge Intelligence , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).

[3]  Zhongfei Zhang,et al.  TVT: Two-View Transformer Network for Video Captioning , 2018, ACML.

[4]  Ke Wang,et al.  AI Benchmark: Running Deep Neural Networks on Android Smartphones , 2018, ECCV Workshops.

[5]  Yue Zhang,et al.  Efficient FPGA Implementation of Softmax Function for DNN Applications , 2018, 2018 12th IEEE International Conference on Anti-counterfeiting, Security, and Identification (ASID).

[6]  Vassilis Paliouras,et al.  Hardware Implementation of a Softmax-Like Function for Deep Learning , 2020, Technologies.

[7]  Wai-Chi Fang,et al.  A Customized Convolutional Neural Network Design Using Improved Softmax Layer for Real-time Human Emotion Recognition , 2019, 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS).

[8]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[9]  Mehdi Rezagholizadeh,et al.  Fully Quantized Transformer for Improved Translation , 2019, ArXiv.

[10]  Anuj Pathania,et al.  Neural Network Inference on Mobile SoCs , 2020, IEEE Design & Test.

[11]  Jian Tang,et al.  Softmax Optimizations for Intel Xeon Processor-based Platforms , 2019, ArXiv.

[12]  Bo Yuan,et al.  Efficient hardware architecture of softmax layer in deep neural network , 2016, 2016 29th IEEE International System-on-Chip Conference (SOCC).

[13]  Michael W. Mahoney,et al.  Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2019, AAAI.

[14]  Anand Raghunathan,et al.  Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers , 2021, 2021 58th ACM/IEEE Design Automation Conference (DAC).

[15]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[17]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[18]  Tara N. Sainath,et al.  Towards Fast and Accurate Streaming End-To-End ASR , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[20]  Danyang Zhu,et al.  A High-Speed and Low-Complexity Architecture for Softmax Function in Deep Learning , 2018, 2018 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS).

[21]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[22]  Eunhyeok Park,et al.  McDRAM: Low Latency and Energy-Efficient Matrix Computations in DRAM , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[23]  Alexander M. Rush,et al.  Bottom-Up Abstractive Summarization , 2018, EMNLP.

[24]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[25]  Guillaume Lample,et al.  Deep Learning for Symbolic Mathematics , 2019, ICLR.

[26]  Wooseok Chang,et al.  Lightweight Approximation of Softmax Layer for On-Device Inference , 2021, Transactions on Computational Science and Computational Intelligence.

[27]  Tara N. Sainath,et al.  A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Nazar Khan,et al.  Machine Learning at the Network Edge: A Survey , 2019, ACM Comput. Surv..

[29]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[30]  Bin Zhao,et al.  Hardware-Aware Softmax Approximation for Deep Neural Networks , 2018, ACCV.

[31]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[32]  Quanyuan Feng,et al.  A High Speed SoftMax VLSI Architecture Based on Basic-Split , 2018, 2018 14th IEEE International Conference on Solid-State and Integrated Circuit Technology (ICSICT).

[33]  Kushal Datta,et al.  Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model , 2019, ArXiv.