Universality and Limitations of Prompt Tuning

Despite the demonstrated empirical efficacy of prompt tuning to adapt a pretrained language model for a new task, the theoretical underpinnings of the difference between"tuning parameters before the input"against"the tuning of model weights"are limited. We thus take one of the first steps to understand the role of soft-prompt tuning for transformer-based architectures. By considering a general purpose architecture, we analyze prompt tuning from the lens of both: universal approximation and limitations with finite-depth fixed-weight pretrained transformers for continuous-valued functions. Our universality result guarantees the existence of a strong transformer with a prompt to approximate any sequence-to-sequence function in the set of Lipschitz functions. The limitations of prompt tuning for limited-depth transformers are first proved by constructing a set of datasets, that cannot be memorized by a prompt of any length for a given single encoder layer. We also provide a lower bound on the required number of tunable prompt parameters and compare the result with the number of parameters required for a low-rank update (based on LoRA) for a single-layer setting. We finally extend our analysis to multi-layer settings by providing sufficient conditions under which the transformer can at best learn datasets from invertible functions only. Our theoretical claims are also corroborated by empirical results.

[1]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[2]  D. Schuurmans,et al.  What learning algorithm is in-context learning? Investigations with linear models , 2022, ICLR.

[3]  Sanjeev Arora,et al.  A Kernel-Based View of Language Model Fine-Tuning , 2022, ICML.

[4]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[5]  Sashank J. Reddi,et al.  Robust Training of Neural Networks using Scale Invariant Architectures , 2022, ICML.

[6]  Colin Wei,et al.  Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers , 2021, NeurIPS.

[7]  Yoav Goldberg,et al.  BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models , 2021, ACL.

[8]  Yelong Shen,et al.  LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[9]  Sang Michael Xie,et al.  Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning , 2021, NeurIPS.

[10]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[11]  Kevin Scaman,et al.  Lipschitz Normalization for Self-Attention Layers with Application to Graph Neural Networks , 2021, ICML.

[12]  Andreas Loukas,et al.  Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth , 2021, ICML.

[13]  Samy Bengio,et al.  Understanding deep learning (still) requires rethinking generalization , 2021, Commun. ACM.

[14]  Armen Aghajanyan,et al.  Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning , 2020, ACL.

[15]  Andriy Mnih,et al.  The Lipschitz Constant of Self-Attention , 2020, ICML.

[16]  Joe Davison,et al.  Compacter: Efficient Low-Rank Hypercomplex Adapter Layers , 2021, NeurIPS.

[17]  P. Barceló,et al.  Attention is Turing-Complete , 2021, J. Mach. Learn. Res..

[18]  Percy Liang,et al.  Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[19]  Sashank J. Reddi,et al.  Why are Adaptive Methods Good for Attention Models? , 2020, NeurIPS.

[20]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[21]  Sashank J. Reddi,et al.  Are Transformers universal approximators of sequence-to-sequence functions? , 2019, ICLR.

[22]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[23]  David Duvenaud,et al.  Invertible Residual Networks , 2018, ICML.

[24]  Suvrit Sra,et al.  Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity , 2018, NeurIPS.

[25]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[26]  Matthias Hein,et al.  Optimization Landscape and Expressivity of Deep CNNs , 2017, ICML.

[27]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[30]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[31]  Guang-Bin Huang,et al.  Learning capability and storage capacity of two-hidden-layer feedforward networks , 2003, IEEE Trans. Neural Networks.

[32]  Guang-Bin Huang,et al.  Upper bounds on the number of hidden neurons in feedforward networks with arbitrary bounded nonlinear activation functions , 1998, IEEE Trans. Neural Networks.

[33]  Masami Yamasaki,et al.  The Lower Bound of the Capacity for a Neural Network with Multiple Hidden Layers , 1993 .

[34]  Y. F. Huang,et al.  Bounds on number of hidden neurons of multilayer perceptrons in classification and recognition , 1990, IEEE International Symposium on Circuits and Systems.