On Optimal Caching and Model Multiplexing for Large Model Inference

Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges. In particular, the large-scale deployment of these models is hindered by the significant resource requirements during inference. In this paper, we study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model multiplexer to choose from an ensemble of models for query processing. Theoretically, we provide an optimal algorithm for jointly optimizing both approaches to reduce the inference cost in both offline and online tabular settings. By combining a caching algorithm, namely Greedy Dual Size with Frequency (GDSF) or Least Expected Cost (LEC), with a model multiplexer, we achieve optimal rates in both offline and online settings. Empirically, simulations show that the combination of our caching and model multiplexing algorithms greatly improves over the baselines, with up to $50\times$ improvement over the baseline when the ratio between the maximum cost and minimum cost is $100$. Experiments on real datasets show a $4.3\times$ improvement in FLOPs over the baseline when the ratio for FLOPs is $10$, and a $1.8\times$ improvement in latency when the ratio for average latency is $1.85$.

[1]  Nandan Thakur,et al.  Evaluating Embedding APIs for Information Retrieval , 2023, ACL.

[2]  James Y. Zou,et al.  FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance , 2023, ArXiv.

[3]  Zhi Rui Tam,et al.  OpenAssistant Conversations - Democratizing Large Language Model Alignment , 2023, ArXiv.

[4]  Marco Tulio Ribeiro,et al.  Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.

[5]  E. Horvitz,et al.  Capabilities of GPT-4 on Medical Challenge Problems , 2023, ArXiv.

[6]  Nanyang Technological University,et al.  A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT , 2023, ArXiv.

[7]  Geoffrey Irving,et al.  Accelerating Large Language Model Decoding with Speculative Sampling , 2023, ArXiv.

[8]  Y. Matias,et al.  Fast Inference from Transformers via Speculative Decoding , 2022, ICML.

[9]  N. Karamchandani,et al.  Regret-Optimal Online Caching for Adversarial and Stochastic Arrivals , 2022, VALUETOOLS.

[10]  Dan Alistarh,et al.  GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers , 2022, ArXiv.

[11]  J. Dean,et al.  A Review of Sparse Expert Models in Deep Learning , 2022, ArXiv.

[12]  Doug Downey,et al.  Embedding Recycling for Language Models , 2022, FINDINGS.

[13]  J. Dean,et al.  Emergent Abilities of Large Language Models , 2022, Trans. Mach. Learn. Res..

[14]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[15]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[16]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[17]  Shenghuang He,et al.  A Flexible Multi-Task Model for BERT Serving , 2021, ACL.

[18]  Qi Zhang,et al.  Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead , 2021, Neural Networks.

[19]  David R. So,et al.  Carbon Emissions and Large Neural Network Training , 2021, ArXiv.

[20]  Michael W. Mahoney,et al.  A Survey of Quantization Methods for Efficient Neural Network Inference , 2021, Low-Power Computer Vision.

[21]  Stuart J. Russell,et al.  Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism , 2021, IEEE Transactions on Information Theory.

[22]  Junaid Shuja,et al.  Applying machine learning techniques for caching in next-generation edge networks: A comprehensive survey , 2021, J. Netw. Comput. Appl..

[23]  Abhishek Sinha,et al.  Online Caching with Optimal Switching Regret , 2021, 2021 IEEE International Symposium on Information Theory (ISIT).

[24]  Zhuoran Yang,et al.  Is Pessimism Provably Efficient for Offline RL? , 2020, ICML.

[25]  Veselin Stoyanov,et al.  General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference , 2020, FINDINGS.

[26]  Yoav Shoham,et al.  The Cost of Training NLP Models: A Concise Overview , 2020, ArXiv.

[27]  Srinivas Shakkottai,et al.  Learning to Cache and Caching to Learn: Regret Analysis of Caching Algorithms , 2020, IEEE/ACM Transactions on Networking.

[28]  Wei-Cheng Chang,et al.  Pre-training Tasks for Embedding-based Large-scale Retrieval , 2020, ICLR.

[29]  Tom B. Brown,et al.  Fine-Tuning Language Models from Human Preferences , 2019, ArXiv.

[30]  Gang Feng,et al.  Multi-Agent Reinforcement Learning for Efficient Content Caching in Mobile D2D Networks , 2019, IEEE Transactions on Wireless Communications.

[31]  Tapani Ristaniemi,et al.  Learn to Cache: Machine Learning for Network Edge Caching in the Big Data Era , 2018, IEEE Wireless Communications.

[32]  Victor C. M. Leung,et al.  Deep-Reinforcement-Learning-Based Optimization for Cache-Enabled Opportunistic Interference Alignment Wireless Networks , 2017, IEEE Transactions on Vehicular Technology.

[33]  Angeliki Lazaridou,et al.  The LAMBADA dataset: Word prediction requiring a broad discourse context , 2016, ACL.

[34]  Björn Buchhold,et al.  Semantic Search on Text and Knowledge Bases , 2016, Found. Trends Inf. Retr..

[35]  Swadhesh Kumar,et al.  An overview of modern cache memory and performance analysis of replacement policies , 2016, 2016 IEEE International Conference on Engineering and Technology (ICETECH).

[36]  Hyokyung Bahn,et al.  Web cache management based on the expected cost of web objects , 2005, Inf. Softw. Technol..

[37]  Sang Lyul Min,et al.  LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies , 2001, IEEE Trans. Computers.

[38]  Azer Bestavros,et al.  Popularity-aware greedy dual-size Web proxy caching algorithms , 2000, Proceedings 20th IEEE International Conference on Distributed Computing Systems.

[39]  Martin F. Arlitt,et al.  Evaluating content management techniques for Web proxy caches , 2000, PERV.

[40]  Jia Wang,et al.  A survey of web caching schemes for the Internet , 1999, CCRV.

[41]  William Stallings,et al.  Operating Systems: Internals and Design Principles , 1991 .

[42]  Jianqing Fan,et al.  High-Dimensional Statistics , 2014 .

[43]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .