CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models
暂无分享,去创建一个
[1] Chaojun Xiao,et al. ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs , 2024, ArXiv.
[2] Devendra Singh Chaplot,et al. Mixtral of Experts , 2024, ArXiv.
[3] Hermann Kumbong,et al. FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores , 2023, ArXiv.
[4] Christopher Ré,et al. Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time , 2023, ICML.
[5] Mehrdad Farajtabar,et al. ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models , 2023, ICLR.
[6] Bartosz W'ojcik,et al. Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion , 2023, 2310.04361.
[7] Matei Zaharia,et al. Ring Attention with Blockwise Transformers for Near-Infinite Context , 2023, ArXiv.
[8] Pannag R. Sanketi,et al. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control , 2023, CoRL.
[9] Tri Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , 2023, ArXiv.
[10] J. Z. Kolter,et al. A Simple and Effective Pruning Approach for Large Language Models , 2023, ICLR.
[11] Julien Launay,et al. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only , 2023, ArXiv.
[12] Xinchao Wang,et al. LLM-Pruner: On the Structural Pruning of Large Language Models , 2023, NeurIPS.
[13] Henrique Pondé de Oliveira Pinto,et al. GPT-4 Technical Report , 2023, 2303.08774.
[14] Michael W. Mahoney,et al. Full Stack Optimization of Transformer Inference: a Survey , 2023, ArXiv.
[15] Ashish Sabharwal,et al. Specializing Smaller Language Models towards Multi-Step Reasoning , 2023, ICML.
[16] Dan Alistarh,et al. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot , 2023, ICML.
[17] Carlos Riquelme Ruiz,et al. Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints , 2022, ICLR.
[18] D. Narayanan,et al. MegaBlocks: Efficient Sparse Training with Mixture-of-Experts , 2022, ArXiv.
[19] Dan Alistarh,et al. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers , 2022, ArXiv.
[20] Sashank J. Reddi,et al. The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers , 2022, ICLR.
[21] Jie Zhou,et al. Mixture of Attention Heads: Selecting Attention Heads Per Token , 2022, EMNLP.
[22] Guohao Dai,et al. Sgap: towards efficient sparse tensor algebra compilation for GPU , 2022, CCF Transactions on High Performance Computing.
[23] J. Dean,et al. A Review of Sparse Expert Models in Deep Learning , 2022, ArXiv.
[24] Rodolphe Jenatton,et al. Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts , 2022, NeurIPS.
[25] Daniel Y. Fu,et al. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , 2022, NeurIPS.
[26] Danqi Chen,et al. Structured Pruning Learns Compact and Accurate Models , 2022, ACL.
[27] Dan Alistarh,et al. The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models , 2022, EMNLP.
[28] Andrew M. Dai,et al. Mixture-of-Experts with Expert Choice Routing , 2022, NeurIPS.
[29] J. Dean,et al. ST-MoE: Designing Stable and Transferable Sparse Expert Models , 2022, 2202.08906.
[30] Reza Yazdani Aminabadi,et al. DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale , 2022, ICML.
[31] Quoc V. Le,et al. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , 2021, ICML.
[32] Markus N. Rabe,et al. Self-attention Does Not Need $O(n^2)$ Memory , 2021, 2112.05682.
[33] Aakanksha Chowdhery,et al. Sparse is Enough in Scaling Transformers , 2021, NeurIPS.
[34] Moshe Wasserblat,et al. Prune Once for All: Sparse Pre-Trained Language Models , 2021, ArXiv.
[35] Jie Zhou,et al. MoEfication: Transformer Feed-forward Layers are Mixtures of Experts , 2021, FINDINGS.
[36] Ankur Bapna,et al. Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference , 2021, EMNLP.
[37] Zangwei Zheng,et al. Cross-token Modeling with Conditional Computation , 2021, 2109.02008.
[38] Jason Weston,et al. Hash Layers For Large Sparse Models , 2021, NeurIPS.
[39] Aakanksha Chowdhery,et al. DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning , 2021, NeurIPS.
[40] Chen Liang,et al. Carbon Emissions and Large Neural Network Training , 2021, ArXiv.
[41] Naman Goyal,et al. BASE Layers: Simplifying Training of Large, Sparse Models , 2021, ICML.
[42] Dan Alistarh,et al. Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks , 2021, J. Mach. Learn. Res..
[43] Noam M. Shazeer,et al. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, J. Mach. Learn. Res..
[44] Michael R. Lyu,et al. BinaryBERT: Pushing the Limit of BERT Quantization , 2020, ACL.
[45] Yaliang Li,et al. Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains , 2020, ACL.
[46] Yu Cheng,et al. Contrastive Distillation on Intermediate Representations for Language Model Compression , 2020, EMNLP.
[47] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.
[48] Tom B. Brown,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[49] Ziheng Wang,et al. Structured Pruning of Large Language Models , 2019, EMNLP.
[50] Teven Le Scao,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.
[51] Samyam Rajbhandari,et al. ZeRO: Memory Optimization Towards Training A Trillion Parameter Models , 2019, ArXiv.
[52] Yu Cheng,et al. Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.
[53] Fedor Moiseev,et al. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.
[54] K. Allegaert,et al. (Preprint) , 2018 .
[55] David Cox,et al. Input-Aware Auto-Tuning of Compute-Bound HPC Kernels , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[56] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.
[57] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.
[58] Luke Zettlemoyer,et al. GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale , 2022, NeurIPS.
[59] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.