Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning

We challenge a common assumption underlying most supervised deep learning: that a model makes a prediction depending only on its parameters and the features of a single input. To this end, we introduce a general-purpose deep learning architecture that takes as input the entire dataset instead of processing one datapoint at a time. Our approach uses self-attention to reason about relationships between datapoints explicitly, which can be seen as realizing non-parametric models using parametric attention mechanisms. However, unlike conventional non-parametric models, we let the model learn end-to-end from the data how to make use of other datapoints for prediction. Empirically, our models solve cross-datapoint lookup and complex reasoning tasks unsolvable by traditional deep learning models. We show highly competitive results on tabular data, early results on CIFAR-10, and give insight into how the model makes use of the interactions between points.

[1]  J. Morgan,et al.  Problems in the Analysis of Survey Data, and a Proposal , 1963 .

[2]  Yee Whye Teh,et al.  LieTransformer: Equivariant self-attention for Lie Groups , 2020, ICML.

[3]  Yoshua Bengio,et al.  Learning a synaptic learning rule , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[4]  Fabian B. Fuchs,et al.  SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks , 2020, NeurIPS.

[5]  Andrew Zisserman,et al.  Perceiver: General Perception with Iterative Attention , 2021, ICML.

[6]  G. King,et al.  What to Do about Missing Values in Time‐Series Cross‐Section Data , 2010 .

[7]  Andrew Gordon Wilson,et al.  Deep Kernel Learning , 2015, AISTATS.

[8]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[9]  Daniel Hernández-Lobato,et al.  Deep Gaussian Processes for Regression using Approximate Expectation Propagation , 2016, ICML.

[10]  Lukasz Kaiser,et al.  Rethinking Attention with Performers , 2020, ArXiv.

[11]  Yair Movshovitz-Attias,et al.  No Fuss Distance Metric Learning Using Proxies , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[14]  Sercan O. Arik,et al.  TabNet: Attentive Interpretable Tabular Learning , 2019, AAAI.

[15]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[16]  Andrew W. Moore,et al.  New Algorithms for Efficient High-Dimensional Nonparametric Classification , 2006, J. Mach. Learn. Res..

[17]  K. Jarrod Millman,et al.  Array programming with NumPy , 2020, Nat..

[18]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[19]  Neil D. Lawrence,et al.  Variational Auto-encoded Deep Gaussian Processes , 2015, ICLR.

[20]  Xiaojin Zhu,et al.  Semi-Supervised Learning , 2010, Encyclopedia of Machine Learning.

[21]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[22]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[23]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[24]  Yee Whye Teh,et al.  Attentive Neural Processes , 2019, ICLR.

[25]  Jian Tang,et al.  AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks , 2018, CIKM.

[26]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[27]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[28]  Matthijs Douze,et al.  Fixing the train-test resolution discrepancy , 2019, NeurIPS.

[29]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[30]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[31]  Yoshua Bengio,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[32]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[33]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[34]  Sivaraman Balakrishnan,et al.  A Unified View of Label Shift Estimation , 2020, NeurIPS.

[35]  R. Zemel,et al.  Neural Relational Inference for Interacting Systems , 2018, ICML.

[36]  Joan Bruna,et al.  Few-Shot Learning with Graph Neural Networks , 2017, ICLR.

[37]  Wei-Yin Loh,et al.  Fifty Years of Classification and Regression Trees , 2014 .

[38]  Tim Salimans,et al.  Axial Attention in Multidimensional Transformers , 2019, ArXiv.

[39]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[40]  Ralf Krestel,et al.  Challenges for Toxic Comment Classification: An In-Depth Error Analysis , 2018, ALW.

[41]  Colin Raffel,et al.  Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[42]  Yoshua Bengio,et al.  Bayesian Model-Agnostic Meta-Learning , 2018, NeurIPS.

[43]  G. King,et al.  Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation , 2001, American Political Science Review.

[44]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[45]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[46]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[47]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[48]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[49]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[50]  Lucas Beyer,et al.  Big Transfer (BiT): General Visual Representation Learning , 2020, ECCV.

[51]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[52]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[53]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[54]  Yang Hua,et al.  Ranked List Loss for Deep Metric Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[56]  Lihi Zelnik-Manor,et al.  ImageNet-21K Pretraining for the Masses , 2021, NeurIPS Datasets and Benchmarks.

[57]  Ismail Elezi,et al.  Learning Intra-Batch Connections for Deep Metric Learning , 2021, ICML.

[58]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[59]  Yee Whye Teh,et al.  Conditional Neural Processes , 2018, ICML.

[60]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[61]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[62]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[63]  Colin Raffel,et al.  Do Transformer Modifications Transfer Across Implementations and Applications? , 2021, EMNLP.

[64]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[65]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[66]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[67]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[68]  Percy Liang,et al.  A Retrieve-and-Edit Framework for Predicting Structured Outputs , 2018, NeurIPS.

[69]  Percy Liang,et al.  Generating Sentences by Editing Prototypes , 2017, TACL.

[70]  Yee Whye Teh,et al.  Set Transformer , 2018, ICML.

[71]  Jure Leskovec,et al.  How Powerful are Graph Neural Networks? , 2018, ICLR.

[72]  Neil D. Lawrence,et al.  Deep Gaussian Processes , 2012, AISTATS.

[73]  Marta Z. Kwiatkowska,et al.  Evaluating Uncertainty Quantification in End-to-End Autonomous Driving Control , 2018, ArXiv.

[74]  Ankur Bapna,et al.  The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[75]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[76]  Geoffrey E. Hinton,et al.  Lookahead Optimizer: k steps forward, 1 step back , 2019, NeurIPS.

[77]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[78]  Anna Veronika Dorogush,et al.  CatBoost: unbiased boosting with categorical features , 2017, NeurIPS.

[79]  Stefan Schaal,et al.  Local dimensionality reduction for locally weighted learning , 1997, Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA'97. 'Towards New Computational Principles for Robotics and Automation'.

[80]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[81]  Yarin Gal,et al.  A Systematic Comparison of Bayesian Deep Learning Robustness in Diabetic Retinopathy Tasks , 2019, ArXiv.

[82]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[83]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ArXiv.

[84]  John F. Canny,et al.  MSA Transformer , 2021, bioRxiv.

[85]  Marc Peter Deisenroth,et al.  Doubly Stochastic Variational Inference for Deep Gaussian Processes , 2017, NIPS.

[86]  A. Gelman,et al.  Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box , 2011 .

[87]  Georg Heigold,et al.  Object-Centric Learning with Slot Attention , 2020, NeurIPS.

[88]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[89]  J. Biggs THE ROLE OF METALEARNING IN STUDY PROCESSES , 1985 .