GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference
暂无分享,去创建一个
[1] Omer Levy,et al. SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.
[2] Martin Andrews,et al. Transformer to CNN: Label-scarce distillation for efficient text classification , 2019, ArXiv.
[3] Bruce Jacob,et al. DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.
[4] Yu Cheng,et al. Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.
[5] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[6] Jimmy J. Lin,et al. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , 2019, ArXiv.
[7] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[8] Kurt Keutzer,et al. Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2020, AAAI.
[9] Yin Yang,et al. Compressing Large-Scale Transformer-Based Models: A Case Study on BERT , 2020, ArXiv.
[10] Anna Rumshisky,et al. Revealing the Dark Secrets of BERT , 2019, EMNLP.
[11] Moshe Wasserblat,et al. Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).
[12] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[13] R'emi Louf,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.
[14] Jimmy J. Lin,et al. Natural Language Generation for Effective Knowledge Distillation , 2019, EMNLP.
[15] Shaohuai Shi,et al. Understanding Top-k Sparsification in Distributed Deep Learning , 2019, ArXiv.
[16] Song Han,et al. HAT: Hardware-Aware Transformers for Efficient Natural Language Processing , 2020, ACL.
[17] Eunhyeok Park,et al. Energy-Efficient Neural Network Accelerator Based on Outlier-Aware Low-Precision Computation , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).
[18] Song Han,et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[19] Edouard Grave,et al. Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.
[20] Yiming Yang,et al. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.
[21] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.
[22] Kevin Duh,et al. Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning , 2020, RepL4NLP@ACL.
[23] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.
[24] Tao Zhang,et al. A Survey of Model Compression and Acceleration for Deep Neural Networks , 2017, ArXiv.
[25] Norman P. Jouppi,et al. CACTI 6.0: A Tool to Model Large Caches , 2009 .
[26] Yves Scherrer,et al. Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation , 2020, EMNLP.
[27] Joel Emer,et al. Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.
[28] Yang Song,et al. Extreme Language Model Compression with Optimal Subwords and Shared Projections , 2019, ArXiv.
[29] Qun Liu,et al. TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.
[30] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[31] Tor M. Aamodt,et al. Modeling Deep Learning Accelerator Enabled GPUs , 2018, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[32] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.
[33] Marco Maggioni,et al. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.
[34] Wayne Luk,et al. Deep Neural Network Approximation for Custom Hardware , 2019, ACM Comput. Surv..
[35] David Thorsley,et al. Post-training Piecewise Linear Quantization for Deep Neural Networks , 2020, ECCV.
[36] Eunhyeok Park,et al. Value-aware Quantization for Training and Inference of Neural Networks , 2018, ECCV.
[37] Joel Silberman,et al. A Scalable Multi- TeraOPS Deep Learning Processor Core for AI Trainina and Inference , 2018, 2018 IEEE Symposium on VLSI Circuits.