暂无分享,去创建一个
Alex Kogan | Dave Dice | D. Dice | Alex Kogan
[1] James Demmel,et al. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.
[2] Yang Yu,et al. TurboTransformers: an efficient GPU serving system for transformer models , 2020, PPoPP.
[3] Alexander M. Rush,et al. The Annotated Transformer , 2018 .
[4] Carole-Jean Wu,et al. Exploiting Parallelism Opportunities with Deep Learning Frameworks , 2019, ACM Trans. Archit. Code Optim..
[5] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[6] Yu Cheng,et al. Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.
[7] Furu Wei,et al. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers , 2020, NeurIPS.
[8] Han Fang,et al. Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.
[9] Lysandre Debut,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.
[10] X. Chu,et al. Energy-efficient Inference Service of Transformer-based Deep Learning Models on GPUs , 2020, 2020 International Conferences on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData) and IEEE Congress on Cybermatics (Cybermatics).
[11] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.
[12] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.
[13] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.
[14] Carole-Jean Wu,et al. Machine Learning at Facebook: Understanding Inference at the Edge , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[15] Niranjan Balasubramanian,et al. DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering , 2020, ACL.
[16] Li Yang,et al. Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.
[17] Di He,et al. Efficient Training of BERT by Progressively Stacking , 2019, ICML.
[18] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.
[19] Yida Wang,et al. Optimizing CNN Model Inference on CPUs , 2018, USENIX Annual Technical Conference.
[20] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[21] Benjamin Van Durme,et al. Which *BERT? A Survey Organizing Contextualized Encoders , 2020, EMNLP.
[22] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.