论文信息 - Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning

Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning

We challenge a common assumption underlying most supervised deep learning: that a model makes a prediction depending only on its parameters and the features of a single input. To this end, we introduce a general-purpose deep learning architecture that takes as input the entire dataset instead of processing one datapoint at a time. Our approach uses self-attention to reason about relationships between datapoints explicitly, which can be seen as realizing non-parametric models using parametric attention mechanisms. However, unlike conventional non-parametric models, we let the model learn end-to-end from the data how to make use of other datapoints for prediction. Empirically, our models solve cross-datapoint lookup and complex reasoning tasks unsolvable by traditional deep learning models. We show highly competitive results on tabular data, early results on CIFAR-10, and give insight into how the model makes use of the interactions between points.

[1] J. Morgan,et al. Problems in the Analysis of Survey Data, and a Proposal , 1963 .

[2] Yee Whye Teh,et al. LieTransformer: Equivariant self-attention for Lie Groups , 2020, ICML.

[3] Yoshua Bengio,et al. Learning a synaptic learning rule , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[4] Fabian B. Fuchs,et al. SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks , 2020, NeurIPS.

[5] Andrew Zisserman,et al. Perceiver: General Perception with Iterative Attention , 2021, ICML.

[6] G. King,et al. What to Do about Missing Values in Time‐Series Cross‐Section Data , 2010 .

[7] Andrew Gordon Wilson,et al. Deep Kernel Learning , 2015, AISTATS.

[8] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[9] Daniel Hernández-Lobato,et al. Deep Gaussian Processes for Regression using Approximate Expectation Propagation , 2016, ICML.

[10] Lukasz Kaiser,et al. Rethinking Attention with Performers , 2020, ArXiv.

[11] Yair Movshovitz-Attias,et al. No Fuss Distance Metric Learning Using Proxies , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12] Chen Sun,et al. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13] J. L. Hodges,et al. Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[14] Sercan O. Arik,et al. TabNet: Attentive Interpretable Tabular Learning , 2019, AAAI.

[15] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[16] Andrew W. Moore,et al. New Algorithms for Efficient High-Dimensional Nonparametric Classification , 2006, J. Mach. Learn. Res..

[17] K. Jarrod Millman,et al. Array programming with NumPy , 2020, Nat..

[18] James Demmel,et al. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[19] Neil D. Lawrence,et al. Variational Auto-encoded Deep Gaussian Processes , 2015, ICLR.

[20] Xiaojin Zhu,et al. Semi-Supervised Learning , 2010, Encyclopedia of Machine Learning.

[21] Tianqi Chen,et al. XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[22] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[23] R. Schapire. The Strength of Weak Learnability , 1990, Machine Learning.

[24] Yee Whye Teh,et al. Attentive Neural Processes , 2019, ICLR.

[25] Jian Tang,et al. AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks , 2018, CIKM.

[26] V. Vapnik. Estimation of Dependences Based on Empirical Data , 2006 .

[27] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[28] Matthijs Douze,et al. Fixing the train-test resolution discrepancy , 2019, NeurIPS.

[29] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[30] Stef van Buuren,et al. MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[31] Yoshua Bengio,et al. Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[32] Ming-Wei Chang,et al. REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[33] Wei-Yin Loh,et al. Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[34] Sivaraman Balakrishnan,et al. A Unified View of Label Shift Estimation , 2020, NeurIPS.

[35] R. Zemel,et al. Neural Relational Inference for Interacting Systems , 2018, ICML.

[36] Joan Bruna,et al. Few-Shot Learning with Graph Neural Networks , 2017, ICLR.

[37] Wei-Yin Loh,et al. Fifty Years of Classification and Regression Trees , 2014 .

[38] Tim Salimans,et al. Axial Attention in Multidimensional Transformers , 2019, ArXiv.

[39] Carl E. Rasmussen,et al. Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[40] Ralf Krestel,et al. Challenges for Toxic Comment Classification: An In-Depth Error Analysis , 2018, ALW.

[41] Colin Raffel,et al. Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[42] Yoshua Bengio,et al. Bayesian Model-Agnostic Meta-Learning , 2018, NeurIPS.

[43] G. King,et al. Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation , 2001, American Political Science Review.

[44] Dustin Tran,et al. Image Transformer , 2018, ICML.

[45] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[46] Max Welling,et al. Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[47] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[48] N. Altman. An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[49] Tie-Yan Liu,et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[50] Lucas Beyer,et al. Big Transfer (BiT): General Visual Representation Learning , 2020, ECCV.

[51] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[52] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[53] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[54] Yang Hua,et al. Ranked List Loss for Deep Metric Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55] Max Welling,et al. Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[56] Lihi Zelnik-Manor,et al. ImageNet-21K Pretraining for the Masses , 2021, NeurIPS Datasets and Benchmarks.

[57] Ismail Elezi,et al. Learning Intra-Batch Connections for Deep Metric Learning , 2021, ICML.

[58] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[59] Yee Whye Teh,et al. Conditional Neural Processes , 2018, ICML.

[60] Oriol Vinyals,et al. Matching Networks for One Shot Learning , 2016, NIPS.

[61] Arman Cohan,et al. Longformer: The Long-Document Transformer , 2020, ArXiv.

[62] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[63] Colin Raffel,et al. Do Transformer Modifications Transfer Across Implementations and Applications? , 2021, EMNLP.

[64] Pietro Liò,et al. Graph Attention Networks , 2017, ICLR.

[65] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[66] Joshua B. Tenenbaum,et al. Human-level concept learning through probabilistic program induction , 2015, Science.

[67] Peter Bühlmann,et al. MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[68] Percy Liang,et al. A Retrieve-and-Edit Framework for Predicting Structured Outputs , 2018, NeurIPS.

[69] Percy Liang,et al. Generating Sentences by Editing Prototypes , 2017, TACL.

[70] Yee Whye Teh,et al. Set Transformer , 2018, ICML.

[71] Jure Leskovec,et al. How Powerful are Graph Neural Networks? , 2018, ICLR.

[72] Neil D. Lawrence,et al. Deep Gaussian Processes , 2012, AISTATS.

[73] Marta Z. Kwiatkowska,et al. Evaluating Uncertainty Quantification in End-to-End Autonomous Driving Control , 2018, ArXiv.

[74] Ankur Bapna,et al. The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[75] Sergey Levine,et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[76] Geoffrey E. Hinton,et al. Lookahead Optimizer: k steps forward, 1 step back , 2019, NeurIPS.

[77] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.

[78] Anna Veronika Dorogush,et al. CatBoost: unbiased boosting with categorical features , 2017, NeurIPS.

[79] Stefan Schaal,et al. Local dimensionality reduction for locally weighted learning , 1997, Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA'97. 'Towards New Computational Principles for Robotics and Automation'.

[80] Jon Louis Bentley,et al. Multidimensional binary search trees used for associative searching , 1975, CACM.

[81] Yarin Gal,et al. A Systematic Comparison of Bayesian Deep Learning Robustness in Diabetic Retinopathy Tasks , 2019, ArXiv.

[82] Nikolaos Pappas,et al. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[83] Yi Tay,et al. Efficient Transformers: A Survey , 2020, ArXiv.

[84] John F. Canny,et al. MSA Transformer , 2021, bioRxiv.

[85] Marc Peter Deisenroth,et al. Doubly Stochastic Variational Inference for Deep Gaussian Processes , 2017, NIPS.

[86] A. Gelman,et al. Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box , 2011 .

[87] Georg Heigold,et al. Object-Centric Learning with Slot Attention , 2020, NeurIPS.

[88] J. Friedman. Greedy function approximation: A gradient boosting machine. , 2001 .

[89] J. Biggs. THE ROLE OF METALEARNING IN STUDY PROCESSES , 1985 .