Learning to Reason with Neural Networks: Generalization, Unseen Data and Boolean Measures

This paper considers the Pointer Value Retrieval (PVR) benchmark introduced in [ZRKB21], where a ‘reasoning’ function acts on a string of digits to produce the label. More generally, the paper considers the learning of logical functions with gradient descent (GD) on neural networks. It is first shown that in order to learn logical functions with gradient descent on symmetric neural networks, the generalization error can be lower-bounded in terms of the noise-stability of the target function, supporting a conjecture made in [ZRKB21]. It is then shown that in the distribution shift setting, when the data withholding corresponds to freezing a single feature (referred to as canonical holdout), the generalization error of gradient descent admits a tight characterization in terms of the Boolean influence for several relevant architectures. This is shown on linear models and supported experimentally on other models such as MLPs and Transformers. In particular, this puts forward the hypothesis that for such architectures and for learning logical functions such as PVR functions, GD tends to have an implicit bias towards low-degree representations , which in turn gives the Boolean influence for the generalization error under quadratic loss.

[1]  Jan Hązła,et al.  An initial alignment between neural network and target is needed for gradient descent to learn , 2022, ICML.

[2]  E. Abbe,et al.  The merged-staircase property: a necessary and nearly sufficient condition for SGD learning of sparse functions on two-layer neural networks , 2022, COLT.

[3]  Tatsunori B. Hashimoto,et al.  Extending the WILDS Benchmark for Unsupervised Adaptation , 2021, ICLR.

[4]  Ali Taylan Cemgil,et al.  A Fine-Grained Analysis on Distribution Shift , 2021, ICLR.

[5]  Nathan Srebro,et al.  On Margin Maximization in Linear and ReLU Networks , 2021, NeurIPS.

[6]  Hunter M. Nisonoff,et al.  Epistatic Net allows the sparse spectral regularization of deep neural networks for inferring fitness functions , 2021, Nature Communications.

[7]  Guy Bresler,et al.  The staircase property: How hierarchical structure can guide deep learning , 2021, NeurIPS.

[8]  Nathan Srebro,et al.  On the Power of Differentiable Learning versus PAC and SQ Learning , 2021, NeurIPS.

[9]  Samy Bengio,et al.  Pointer Value Retrieval: A new benchmark for understanding the limits of neural network generalization , 2021, ArXiv.

[10]  Yair Carmon,et al.  Accuracy on the Line: on the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization , 2021, ICML.

[11]  Nicolas Flammarion,et al.  Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity , 2021, NeurIPS.

[12]  A. Dosovitskiy,et al.  MLP-Mixer: An all-MLP Architecture for Vision , 2021, NeurIPS.

[13]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[14]  Matus Telgarsky,et al.  Characterizing the implicit bias via a primal-dual analysis , 2019, ALT.

[15]  Aleksei Udovenko,et al.  MILP modeling of Boolean functions by minimum number of inequalities , 2021, IACR Cryptol. ePrint Arch..

[16]  Nathan Srebro,et al.  Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy , 2020, NeurIPS.

[17]  Nadav Cohen,et al.  Implicit Regularization in Deep Learning May Not Be Explainable by Norms , 2020, NeurIPS.

[18]  Francis Bach,et al.  Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss , 2020, COLT.

[19]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[20]  Kaifeng Lyu,et al.  Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , 2019, ICLR.

[21]  Zheng Ma,et al.  Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks , 2019, Communications in Computational Physics.

[22]  Emmanuel Abbe,et al.  On the universality of deep learning , 2020, NeurIPS.

[23]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[24]  Sanjeev Arora,et al.  Implicit Regularization in Deep Matrix Factorization , 2019, NeurIPS.

[25]  Abhimanyu Das,et al.  On the Learnability of Deep Random Networks , 2019, ArXiv.

[26]  Ramji Venkataramanan,et al.  Boolean Functions with Biased Inputs: Approximation and Noise Sensitivity , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[27]  Zhi-Qin John Xu,et al.  Training behavior of deep neural network in frequency domain , 2018, ICONIP.

[28]  Yoshua Bengio,et al.  On the Spectral Bias of Neural Networks , 2018, ICML.

[29]  Susmit Jha,et al.  Explaining AI Decisions Using Efficient Methods for Learning Sparse Boolean Formulae , 2018, Journal of Automated Reasoning.

[30]  Nathan Srebro,et al.  Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[31]  Nathan Srebro,et al.  Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.

[32]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[33]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[34]  Jian Shen,et al.  Wasserstein Distance Guided Representation Learning for Domain Adaptation , 2017, AAAI.

[35]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[38]  T. Sanders,et al.  Analysis of Boolean Functions , 2012, ArXiv.

[39]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[40]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[41]  Yishay Mansour,et al.  Weakly learning DNF and characterizing statistical query learning using Fourier analysis , 1994, STOC '94.

[42]  Michael Kearns,et al.  Efficient noise-tolerant learning from statistical queries , 1993, STOC.