Full-Stack Optimizing Transformer Inference on ARM Many-Core CPU
暂无分享,去创建一个
[1] Yutong Lu,et al. Characterizing and Optimizing Transformer Inference on ARM Many-core Processor , 2022, ICPP.
[2] Daniel Y. Fu,et al. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , 2022, NeurIPS.
[3] Dezun Dong,et al. Libshalom: Optimizing Small and Irregular-Shaped Matrix Multiplications on ARMv8 Multi-Cores , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.
[4] G. Gao,et al. E.T.: Re-Thinking Self-Attention for Transformer Models on GPUs , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.
[5] Manish Parashar,et al. Exploring the Role of Machine Learning in Scientific Workflows: Opportunities and Challenges , 2021, ArXiv.
[6] Nong Xiao,et al. NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture , 2021, Electronics.
[7] Zhiguang Chen,et al. Optimizing Massively Parallel Winograd Convolution on ARM Processor , 2021, ICPP.
[8] Miryung Kim,et al. Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads , 2021, OSDI.
[9] Alex Kogan,et al. Optimizing Inference Performance of Transformers on CPUs , 2021, ArXiv.
[10] Yang Yu,et al. TurboTransformers: an efficient GPU serving system for transformer models , 2020, PPoPP.
[11] Yuan Meng,et al. How to Efficiently Train Your AI Agent? Characterizing and Evaluating Deep Reinforcement Learning on Heterogeneous Platforms , 2020, 2020 IEEE High Performance Extreme Computing Conference (HPEC).
[12] Marta Garcia-Gasulla,et al. Performance and energy consumption of HPC workloads on a cluster based on Arm ThunderX2 CPU , 2020, Future Gener. Comput. Syst..
[13] Xiaoyan Liu,et al. The Deep Learning Compiler: A Comprehensive Survey , 2020, IEEE Transactions on Parallel and Distributed Systems.
[14] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.
[15] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.
[16] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[17] Dingwen Tao,et al. TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs , 2019, ICS.
[18] Thierry Moreau,et al. Learning to Optimize Tensor Programs , 2018, NeurIPS.
[19] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.
[20] Wei Ge,et al. The Sunway TaihuLight supercomputer: system and applications , 2016, Science China Information Sciences.
[21] Zheng Zhang,et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.
[22] Robert A. van de Geijn,et al. High-performance implementation of the level-3 BLAS , 2008, TOMS.
[23] Y. Kodama,et al. Co-Design for A64FX Manycore Processor and ”Fugaku” , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.
[24] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[25] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .