Benefits of Transformer: In-Context Learning in Linear Regression Tasks with Unstructured Data

In practice, it is observed that transformer-based models can learn concepts in context in the inference stage. While existing literature, e.g., \citet{zhang2023trained,huang2023context}, provide theoretical explanations on this in-context learning ability, they assume the input $x_i$ and the output $y_i$ for each sample are embedded in the same token (i.e., structured data). However, in reality, they are presented in two tokens (i.e., unstructured data \cite{wibisono2023role}). In this case, this paper conducts experiments in linear regression tasks to study the benefits of the architecture of transformers and provides some corresponding theoretical intuitions to explain why the transformer can learn from unstructured data. We study the exact components in a transformer that facilitate the in-context learning. In particular, we observe that (1) a transformer with two layers of softmax (self-)attentions with look-ahead attention mask can learn from the prompt if $y_i$ is in the token next to $x_i$ for each example; (2) positional encoding can further improve the performance; and (3) multi-head attention with a high input embedding dimension has a better prediction performance than single-head attention.

[1]  S. Sra,et al.  Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context , 2023, ArXiv.

[2]  Christos Thrampoulidis,et al.  On the Optimization and Generalization of Multi-head Attention , 2023, ArXiv.

[3]  Yu Huang,et al.  In-Context Convergence of Transformers , 2023, ArXiv.

[4]  Timothy Chu,et al.  Fine-tune Language Models to Approximate Unbiased In-context Learning , 2023, ArXiv.

[5]  S. Sra,et al.  Linear attention is (maybe) all you need (to understand transformer optimization) , 2023, ICLR.

[6]  Yiqi Wang,et al.  LinRec: Linear Attention Mechanism for Long-term Sequential Recommender Systems , 2023, SIGIR.

[7]  S. Mahadevan,et al.  Zero-th Order Algorithm for Softmax Attention Optimization , 2023, ArXiv.

[8]  Yeqi Gao,et al.  In-Context Learning for Attention Scheme: from Single Softmax Regression to Multiple Softmax Regression via a Tensor Trick , 2023, ArXiv.

[9]  P. Bartlett,et al.  Trained Transformers Learn Linear Models In-Context , 2023, ArXiv.

[10]  Song Mei,et al.  Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection , 2023, ArXiv.

[11]  A. Rawat,et al.  On the Role of Attention in Prompt-tuning , 2023, ICML.

[12]  Renjie Liao,et al.  Memorization Capacity of Multi-Head Attention in Transformers , 2023, ArXiv.

[13]  S. Sra,et al.  Transformers learn to implement preconditioned gradient descent for in-context learning , 2023, NeurIPS.

[14]  Jason D. Lee,et al.  Reward Collapse in Aligning Large Language Models , 2023, ArXiv.

[15]  Shuai Li,et al.  The Closeness of In-Context Learning and Weight Shifting for Softmax Regression , 2023, ArXiv.

[16]  M. Wang,et al.  A Theoretical Understanding of shallow Vision Transformers: Learning, Generalization, and Sample Complexity , 2023, ICLR.

[17]  Dimitris Papailiopoulos,et al.  Transformers as Algorithms: Generalization and Stability in In-context Learning , 2023, ICML.

[18]  Li Dong,et al.  Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers , 2022, 2212.10559.

[19]  A. Zhmoginov,et al.  Transformers learn in-context by gradient descent , 2022, ICML.

[20]  D. Schuurmans,et al.  What learning algorithm is in-context learning? Investigations with linear models , 2022, ICLR.

[21]  Michael E. Sander,et al.  Vision Transformers provably learn spatial structure , 2022, Neural Information Processing Systems.

[22]  Percy Liang,et al.  What Can Transformers Learn In-Context? A Case Study of Simple Function Classes , 2022, NeurIPS.

[23]  Lingpeng Kong,et al.  Linear Complexity Randomized Self-attention Mechanism , 2022, ICML.

[24]  I. Assent,et al.  Generalized Classification of Satellite Image Time Series with Thermal Positional Encoding , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[25]  Ding-Xuan Zhou,et al.  Attention Enables Zero Approximation Error , 2022, ArXiv.

[26]  Junjie Yan,et al.  cosFormer: Rethinking Softmax in Attention , 2022, ICLR.

[27]  Rickard Brüel Gabrielsson,et al.  Rewiring with Positional Encodings for Graph Neural Networks , 2022, ArXiv.

[28]  A. Schwing,et al.  Masked-attention Mask Transformer for Universal Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Xuanjing Huang,et al.  Mask Attention Networks: Rethinking and Strengthen Transformer , 2021, NAACL.

[30]  Changyou Chen,et al.  Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference , 2020, EMNLP.

[31]  Rui Li,et al.  Linear Attention Mechanism: An Efficient Attention for Semantic Segmentation , 2020, ArXiv.

[32]  Qi Tian,et al.  Polar Relative Positional Encoding for Video-Language Segmentation , 2020, IJCAI.

[33]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[34]  Xiu-Shen Wei,et al.  Bi-Modal Progressive Mask Attention for Fine-Grained Recognition , 2020, IEEE Transactions on Image Processing.

[35]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[36]  Chris Quirk,et al.  Novel positional encodings to enable tree-based transformers , 2019, NeurIPS.

[37]  Fahad Shahbaz Khan,et al.  Mask-Guided Attention Network for Occluded Pedestrian Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Betty van Aken,et al.  How Does BERT Answer Questions?: A Layer-Wise Analysis of Transformer Representations , 2019, CIKM.

[39]  Shuai Yi,et al.  Efficient Attention: Attention with Linear Complexities , 2018, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[40]  Iasonas Kokkinos,et al.  Segmentation-Aware Convolutional Networks Using Local Attention Masks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[41]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[42]  Benoît Crabbé,et al.  How Many Layers and Why? An Analysis of the Model Depth in Transformers , 2021, ACL.

[43]  Aidong Zhang,et al.  A Survey on Context Learning , 2017, IEEE Transactions on Knowledge and Data Engineering.