FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

There is a rapidly growing number of large language models (LLMs) that users can query for a fee. We review the cost associated with querying popular LLM APIs, e.g. GPT-4, ChatGPT, J1-Jumbo, and find that these models have heterogeneous pricing structures, with fees that can differ by two orders of magnitude. In particular, using LLMs on large collections of queries and text can be expensive. Motivated by this, we outline and discuss three types of strategies that users can exploit to reduce the inference cost associated with using LLMs: 1) prompt adaptation, 2) LLM approximation, and 3) LLM cascade. As an example, we propose FrugalGPT, a simple yet flexible instantiation of LLM cascade which learns which combinations of LLMs to use for different queries in order to reduce cost and improve accuracy. Our experiments show that FrugalGPT can match the performance of the best individual LLM (e.g. GPT-4) with up to 98% cost reduction or improve the accuracy over GPT-4 by 4% with the same cost. The ideas and findings presented here lay a foundation for using LLMs sustainably and efficiently.

[1]  Yuxiong He,et al.  A Comprehensive Study on Post-Training Quantization for Large Language Models , 2023, ArXiv.

[2]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[3]  Yann LeCun,et al.  Augmented Language Models: a Survey , 2023, Trans. Mach. Learn. Res..

[4]  Dan Alistarh,et al.  ZipLM: Hardware-Aware Structured Pruning of Language Models , 2023, ArXiv.

[5]  Xiang Lisa Li,et al.  Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP , 2022, ArXiv.

[6]  Matt Gardner,et al.  Successive Prompting for Decomposing Complex Questions , 2022, EMNLP.

[7]  Song Han,et al.  SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models , 2022, ArXiv.

[8]  Mayee F. Chen,et al.  Ask Me Anything: A simple strategy for prompting language models , 2022, ICLR.

[9]  Ashish Sabharwal,et al.  Decomposed Prompting: A Modular Approach for Solving Complex Tasks , 2022, ICLR.

[10]  D. Schuurmans,et al.  Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , 2022, ICLR.

[11]  V. Runkana,et al.  Semi-Supervised Cascaded Clustering for Classification of Noisy Label Data , 2022, arXiv.org.

[12]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[13]  Carole-Jean Wu,et al.  Sustainable AI: Environmental Implications, Challenges and Opportunities , 2021, MLSys.

[14]  Ronan Le Bras,et al.  Generated Knowledge Prompting for Commonsense Reasoning , 2021, ACL.

[15]  Michael R. Lyu,et al.  Towards Efficient Post-training Quantization of Pre-trained Language Models , 2021, NeurIPS.

[16]  Matei Zaharia,et al.  Did the Model Change? Efficiently Assessing Machine Learning API Shifts , 2021, ArXiv.

[17]  James Y. Zou,et al.  Efficient Online ML API Selection for Multi-Label Classification Tasks , 2021, ICML.

[18]  Weizhu Chen,et al.  What Makes Good In-Context Examples for GPT-3? , 2021, DEELIO.

[19]  Daniel E. Ho,et al.  When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings , 2021, ICAIL.

[20]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[21]  Hao Zhang,et al.  TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models , 2021, ICML.

[22]  Ankur Sinha,et al.  Impact of News on the Commodity Market: Dataset and Results , 2020, Advances in Intelligent Systems and Computing.

[23]  Matei Zaharia,et al.  FrugalML: How to Use ML Prediction APIs More Accurately and Cheaply , 2020, NeurIPS.

[24]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[25]  Dan Jurafsky,et al.  Racial disparities in automated speech recognition , 2020, Proceedings of the National Academy of Sciences.

[26]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[27]  Stephen Cass,et al.  Taking AI to the edge: Google's TPU now comes in a maker-friendly package , 2019, IEEE Spectrum.

[28]  Danqi Chen,et al.  CoQA: A Conversational Question Answering Challenge , 2018, TACL.

[29]  Alexander Aiken,et al.  Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[30]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[31]  Jian Sun,et al.  Deep Learning with Low Precision by Half-Wave Gaussian Quantization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Luc Van Gool,et al.  Weakly Supervised Cascaded Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[34]  Fan Yang,et al.  Exploring the diversity in cluster ensemble generation: Random sampling and random projection , 2014, Expert Syst. Appl..

[35]  Jimmy J. Lin,et al.  A cascade ranking model for efficient ranked retrieval , 2011, SIGIR.

[36]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[37]  J. Friedman Stochastic gradient boosting , 2002 .

[38]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.