CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models

Large Language Models (LLMs) have dramatically advanced AI applications, yet their deployment remains challenging due to their immense inference costs. Recent studies ameliorate the computational costs of LLMs by increasing their activation sparsity but suffer from significant performance degradation on downstream tasks. In this work, we introduce a new framework for sparsifying the activations of base LLMs and reducing inference costs, dubbed Contextually Aware Thresholding for Sparsity (CATS). CATS is relatively simple, easy to implement, and highly effective. At the heart of our framework is a new non-linear activation function. We demonstrate that CATS can be applied to various base models, including Mistral-7B and Llama2-7B, and outperforms existing sparsification techniques in downstream task performance. More precisely, CATS-based models often achieve downstream task performance within 1-2% of their base models without any fine-tuning and even at activation sparsity levels of 50%. Furthermore, CATS-based models converge faster and display better task performance than competing techniques when fine-tuning is applied. Finally, we develop a custom GPU kernel for efficient implementation of CATS that translates the activation of sparsity of CATS to real wall-clock time speedups. Our custom kernel implementation of CATS results in a ~15% improvement in wall-clock inference latency of token generation on both Llama-7B and Mistral-7B.

[1]  Chaojun Xiao,et al.  ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs , 2024, ArXiv.

[2]  Devendra Singh Chaplot,et al.  Mixtral of Experts , 2024, ArXiv.

[3]  Hermann Kumbong,et al.  FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores , 2023, ArXiv.

[4]  Christopher Ré,et al.  Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time , 2023, ICML.

[5]  Mehrdad Farajtabar,et al.  ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models , 2023, ICLR.

[6]  Bartosz W'ojcik,et al.  Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion , 2023, 2310.04361.

[7]  Matei Zaharia,et al.  Ring Attention with Blockwise Transformers for Near-Infinite Context , 2023, ArXiv.

[8]  Pannag R. Sanketi,et al.  RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control , 2023, CoRL.

[9]  Tri Dao FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , 2023, ArXiv.

[10]  J. Z. Kolter,et al.  A Simple and Effective Pruning Approach for Large Language Models , 2023, ICLR.

[11]  Julien Launay,et al.  The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only , 2023, ArXiv.

[12]  Xinchao Wang,et al.  LLM-Pruner: On the Structural Pruning of Large Language Models , 2023, NeurIPS.

[13]  Henrique Pondé de Oliveira Pinto,et al.  GPT-4 Technical Report , 2023, 2303.08774.

[14]  Michael W. Mahoney,et al.  Full Stack Optimization of Transformer Inference: a Survey , 2023, ArXiv.

[15]  Ashish Sabharwal,et al.  Specializing Smaller Language Models towards Multi-Step Reasoning , 2023, ICML.

[16]  Dan Alistarh,et al.  SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot , 2023, ICML.

[17]  Carlos Riquelme Ruiz,et al.  Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints , 2022, ICLR.

[18]  D. Narayanan,et al.  MegaBlocks: Efficient Sparse Training with Mixture-of-Experts , 2022, ArXiv.

[19]  Dan Alistarh,et al.  GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers , 2022, ArXiv.

[20]  Sashank J. Reddi,et al.  The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers , 2022, ICLR.

[21]  Jie Zhou,et al.  Mixture of Attention Heads: Selecting Attention Heads Per Token , 2022, EMNLP.

[22]  Guohao Dai,et al.  Sgap: towards efficient sparse tensor algebra compilation for GPU , 2022, CCF Transactions on High Performance Computing.

[23]  J. Dean,et al.  A Review of Sparse Expert Models in Deep Learning , 2022, ArXiv.

[24]  Rodolphe Jenatton,et al.  Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts , 2022, NeurIPS.

[25]  Daniel Y. Fu,et al.  FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , 2022, NeurIPS.

[26]  Danqi Chen,et al.  Structured Pruning Learns Compact and Accurate Models , 2022, ACL.

[27]  Dan Alistarh,et al.  The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models , 2022, EMNLP.

[28]  Andrew M. Dai,et al.  Mixture-of-Experts with Expert Choice Routing , 2022, NeurIPS.

[29]  J. Dean,et al.  ST-MoE: Designing Stable and Transferable Sparse Expert Models , 2022, 2202.08906.

[30]  Reza Yazdani Aminabadi,et al.  DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale , 2022, ICML.

[31]  Quoc V. Le,et al.  GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , 2021, ICML.

[32]  Markus N. Rabe,et al.  Self-attention Does Not Need $O(n^2)$ Memory , 2021, 2112.05682.

[33]  Aakanksha Chowdhery,et al.  Sparse is Enough in Scaling Transformers , 2021, NeurIPS.

[34]  Moshe Wasserblat,et al.  Prune Once for All: Sparse Pre-Trained Language Models , 2021, ArXiv.

[35]  Jie Zhou,et al.  MoEfication: Transformer Feed-forward Layers are Mixtures of Experts , 2021, FINDINGS.

[36]  Ankur Bapna,et al.  Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference , 2021, EMNLP.

[37]  Zangwei Zheng,et al.  Cross-token Modeling with Conditional Computation , 2021, 2109.02008.

[38]  Jason Weston,et al.  Hash Layers For Large Sparse Models , 2021, NeurIPS.

[39]  Aakanksha Chowdhery,et al.  DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning , 2021, NeurIPS.

[40]  Chen Liang,et al.  Carbon Emissions and Large Neural Network Training , 2021, ArXiv.

[41]  Naman Goyal,et al.  BASE Layers: Simplifying Training of Large, Sparse Models , 2021, ICML.

[42]  Dan Alistarh,et al.  Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks , 2021, J. Mach. Learn. Res..

[43]  Noam M. Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, J. Mach. Learn. Res..

[44]  Michael R. Lyu,et al.  BinaryBERT: Pushing the Limit of BERT Quantization , 2020, ACL.

[45]  Yaliang Li,et al.  Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains , 2020, ACL.

[46]  Yu Cheng,et al.  Contrastive Distillation on Intermediate Representations for Language Model Compression , 2020, EMNLP.

[47]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[48]  Tom B. Brown,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[49]  Ziheng Wang,et al.  Structured Pruning of Large Language Models , 2019, EMNLP.

[50]  Teven Le Scao,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[51]  Samyam Rajbhandari,et al.  ZeRO: Memory Optimization Towards Training A Trillion Parameter Models , 2019, ArXiv.

[52]  Yu Cheng,et al.  Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[53]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[54]  K. Allegaert,et al.  (Preprint) , 2018 .

[55]  David Cox,et al.  Input-Aware Auto-Tuning of Compute-Bound HPC Kernels , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[56]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[57]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[58]  Luke Zettlemoyer,et al.  GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale , 2022, NeurIPS.

[59]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.