Optimizing Inference Performance of Transformers on CPUs

The Transformer architecture revolutionized the field of natural language processing (NLP). Transformers-based models (e.g., BERT) power many important Web services, such as search, translation, question-answering, etc. While enormous research attention is paid to the training of those models, relatively little efforts are made to improve their inference performance. This paper comes to address this gap by presenting an empirical analysis of scalability and performance of inferencing a Transformer-based model on CPUs. Focusing on the highly popular BERT model, we identify key components of the Transformer architecture where the bulk of the computation happens, and propose an Adaptive Linear Module Optimization (ALMO) to speed them up. The optimization is evaluated using the inference benchmark from HuggingFace, and is shown to achieve the speedup of up to x1.71. Notably, ALMO does not require any changes to the implementation of the models nor affects their accuracy.

[1]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[2]  Yang Yu,et al.  TurboTransformers: an efficient GPU serving system for transformer models , 2020, PPoPP.

[3]  Alexander M. Rush,et al.  The Annotated Transformer , 2018 .

[4]  Carole-Jean Wu,et al.  Exploiting Parallelism Opportunities with Deep Learning Frameworks , 2019, ACM Trans. Archit. Code Optim..

[5]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[6]  Yu Cheng,et al.  Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[7]  Furu Wei,et al.  MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers , 2020, NeurIPS.

[8]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[9]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[10]  X. Chu,et al.  Energy-efficient Inference Service of Transformer-based Deep Learning Models on GPUs , 2020, 2020 International Conferences on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData) and IEEE Congress on Cybermatics (Cybermatics).

[11]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[12]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[13]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[14]  Carole-Jean Wu,et al.  Machine Learning at Facebook: Understanding Inference at the Edge , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[15]  Niranjan Balasubramanian,et al.  DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering , 2020, ACL.

[16]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[17]  Di He,et al.  Efficient Training of BERT by Progressively Stacking , 2019, ICML.

[18]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[19]  Yida Wang,et al.  Optimizing CNN Model Inference on CPUs , 2018, USENIX Annual Technical Conference.

[20]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[21]  Benjamin Van Durme,et al.  Which *BERT? A Survey Organizing Contextualized Encoders , 2020, EMNLP.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.