Full-Stack Optimizing Transformer Inference on ARM Many-Core CPU

The past several years have witnessed tremendous success of transformer models in natural language processing (NLP), and their current landscape is increasingly diverse. Although GPU gradually becomes the dominating workhorse and de facto standard for deep learning, there are still many scenarios where using CPU remains a prevalent choice.Recently, ARM many-core processor starts emigrating to cloud computing and high-performance computing, which is promising to deploy transformer inference. In this paper, we identify several performance bottlenecks of existing inference runtime on many-core CPU including low-core usage, isolated thread configuration, inappropriate implementation of general matrix multiply (GEMM), and redundant computations for variable-length inputs. To tackle these problems, full-stack optimizations are conducted for these challenges from service level to operator level. We explore multi-instance parallelization at the service level to improve CPU core usage. To improve parallel efficiency of the inference runtime, we design NUMA-aware thread scheduling and a look-up table for optimal parallel configurations. The GEMM implementation is tailored for some critical modules to exploit the characteristics of transformer workload. To eliminate redundant computations, a novel storage format is designed and implemented to pack sparse data and a load balancing strategy is proposed for tasks with different sparsity. Experiments show that our implementation can outperform existing solutions by 1.1x to 6x with fixed-length inputs. For variable-length inputs, it achieves 1.9x to 8x speedups on different ARM many-core processors.

[1]  Yutong Lu,et al.  Characterizing and Optimizing Transformer Inference on ARM Many-core Processor , 2022, ICPP.

[2]  Daniel Y. Fu,et al.  FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , 2022, NeurIPS.

[3]  Dezun Dong,et al.  Libshalom: Optimizing Small and Irregular-Shaped Matrix Multiplications on ARMv8 Multi-Cores , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  G. Gao,et al.  E.T.: Re-Thinking Self-Attention for Transformer Models on GPUs , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Manish Parashar,et al.  Exploring the Role of Machine Learning in Scientific Workflows: Opportunities and Challenges , 2021, ArXiv.

[6]  Nong Xiao,et al.  NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture , 2021, Electronics.

[7]  Zhiguang Chen,et al.  Optimizing Massively Parallel Winograd Convolution on ARM Processor , 2021, ICPP.

[8]  Miryung Kim,et al.  Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads , 2021, OSDI.

[9]  Alex Kogan,et al.  Optimizing Inference Performance of Transformers on CPUs , 2021, ArXiv.

[10]  Yang Yu,et al.  TurboTransformers: an efficient GPU serving system for transformer models , 2020, PPoPP.

[11]  Yuan Meng,et al.  How to Efficiently Train Your AI Agent? Characterizing and Evaluating Deep Reinforcement Learning on Heterogeneous Platforms , 2020, 2020 IEEE High Performance Extreme Computing Conference (HPEC).

[12]  Marta Garcia-Gasulla,et al.  Performance and energy consumption of HPC workloads on a cluster based on Arm ThunderX2 CPU , 2020, Future Gener. Comput. Syst..

[13]  Xiaoyan Liu,et al.  The Deep Learning Compiler: A Comprehensive Survey , 2020, IEEE Transactions on Parallel and Distributed Systems.

[14]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[15]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[16]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[17]  Dingwen Tao,et al.  TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs , 2019, ICS.

[18]  Thierry Moreau,et al.  Learning to Optimize Tensor Programs , 2018, NeurIPS.

[19]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[20]  Wei Ge,et al.  The Sunway TaihuLight supercomputer: system and applications , 2016, Science China Information Sciences.

[21]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[22]  Robert A. van de Geijn,et al.  High-performance implementation of the level-3 BLAS , 2008, TOMS.

[23]  Y. Kodama,et al.  Co-Design for A64FX Manycore Processor and ”Fugaku” , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[24]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[25]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .