Learning explanations that are hard to vary

In this paper, we investigate the principle that `good explanations are hard to vary' in the context of deep learning. We show that averaging gradients across examples -- akin to a logical OR of patterns -- can favor memorization and `patchwork' solutions that sew together different strategies, instead of identifying invariances. To inspect this, we first formalize a notion of consistency for minima of the loss surface, which measures to what extent a minimum appears only when examples are pooled. We then propose and experimentally validate a simple alternative algorithm based on a logical AND, that focuses on invariances and prevents memorization in a set of real-world tasks. Finally, using a synthetic dataset with a clear distinction between invariant and spurious mechanisms, we dissect learning signals and compare this approach to well-established regularizers.

[1]  Dominik Janzing,et al.  Causal Regularization , 2019, NeurIPS.

[2]  K. Hoover The Logic of Causal Inference: Econometrics and the Conditional Analysis of Causation , 1990, Economics and Philosophy.

[3]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[4]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[5]  Michael I. Jordan,et al.  Gradient Descent Converges to Minimizers , 2016, ArXiv.

[6]  Marco Loog,et al.  Semi-Generative Modelling: Covariate-Shift Adaptation with Cause and Effect Features , 2018, AISTATS.

[7]  Michael W. Mahoney,et al.  PyHessian: Neural Networks Through the Lens of the Hessian , 2019, 2020 IEEE International Conference on Big Data (Big Data).

[8]  Guodong Zhang,et al.  Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , 2019, NeurIPS.

[9]  Bernhard Schölkopf,et al.  Domain Generalization via Invariant Feature Representation , 2013, ICML.

[10]  Suchi Saria,et al.  Preventing Failures Due to Dataset Shift: Learning Predictive Models That Transport , 2018, AISTATS.

[11]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[12]  Aurélien Lucchi,et al.  Ellipsoidal Trust Region Methods and the Marginal Value of Hessian Information for Neural Network Training , 2019, ArXiv.

[13]  Klaus-Robert Müller,et al.  Covariate Shift Adaptation by Importance Weighted Cross Validation , 2007, J. Mach. Learn. Res..

[14]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[15]  H. Simon,et al.  Causal Ordering and Identifiability , 1977 .

[16]  Chi-Kwong Li Geometric Means , 2003 .

[17]  Bernhard Schölkopf,et al.  Regression by dependence minimization and its application to causal inference in additive noise models , 2009, ICML '09.

[18]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[19]  Michael I. Jordan,et al.  Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent , 2017, COLT.

[20]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[21]  David D. Cox,et al.  Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.

[22]  Bernhard Schölkopf,et al.  On causal and anticausal learning , 2012, ICML.

[23]  Dan Alistarh,et al.  WoodFisher: Efficient second-order approximations for model compression , 2020, ArXiv.

[24]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[25]  Christina Heinze-Deml,et al.  Conditional variance penalties and domain shift robustness , 2017, Machine Learning.

[26]  Bernhard Schölkopf,et al.  Invariant Models for Causal Transfer Learning , 2015, J. Mach. Learn. Res..

[27]  Christina Heinze-Deml,et al.  Invariant Causal Prediction for Nonlinear Models , 2017, Journal of Causal Inference.

[28]  Aaron C. Courville,et al.  Out-of-Distribution Generalization via Risk Extrapolation (REx) , 2020, ICML.

[29]  Jonas Peters,et al.  Causal inference by using invariant prediction: identification and confidence intervals , 2015, 1501.01332.

[30]  Massoud Pedram,et al.  Gradient Agreement as an Optimization Objective for Meta-Learning , 2018, ArXiv.

[31]  Howard Barnum,et al.  The Beginning of Infinity: Explanations That Transform the World , 2012 .

[32]  Bernhard Schölkopf,et al.  Learning Independent Causal Mechanisms , 2017, ICML.

[33]  Yann Dauphin,et al.  Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.

[34]  Razvan Pascanu,et al.  Adapting Auxiliary Losses Using Gradient Similarity , 2018, ArXiv.

[35]  Taehoon Kim,et al.  Quantifying Generalization in Reinforcement Learning , 2018, ICML.

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[38]  Alexandre M. Bayen,et al.  Accelerated Mirror Descent in Continuous and Discrete Time , 2015, NIPS.

[39]  Srini Narayanan,et al.  Stiffness: A New Perspective on Generalization in Neural Networks , 2019, ArXiv.

[40]  Bernhard Schölkopf,et al.  Elements of Causal Inference: Foundations and Learning Algorithms , 2017 .

[41]  Swami Sankaranarayanan,et al.  MetaReg: Towards Domain Generalization using Meta-Regularization , 2018, NeurIPS.

[42]  T. Haavelmo The Statistical Implications of a System of Simultaneous Equations , 1943 .

[43]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[44]  Shiqian Ma,et al.  Stochastic Quasi-Newton Methods for Nonconvex Stochastic Optimization , 2014, SIAM J. Optim..

[45]  J. Schulman,et al.  Leveraging Procedural Generation to Benchmark Reinforcement Learning , 2019, ICML.

[46]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.

[47]  David Lopez-Paz,et al.  Invariant Risk Minimization , 2019, ArXiv.

[48]  Motoaki Kawanabe,et al.  Machine Learning in Non-Stationary Environments - Introduction to Covariate Shift Adaptation , 2012, Adaptive computation and machine learning.

[49]  Xiaoxia Wu,et al.  AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization , 2018, ICML.

[50]  Thomas Hofmann,et al.  Escaping Saddles with Stochastic Gradients , 2018, ICML.

[51]  Leonid Hurwicz,et al.  On the Structural Form of Interdependent Systems , 1966 .

[52]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[53]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[54]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[55]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[56]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[57]  Bernhard Schölkopf,et al.  Causality for Machine Learning , 2019, ArXiv.

[58]  Amit Dhurandhar,et al.  Invariant Risk Minimization Games , 2020, ICML.

[59]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[60]  Greg Turk,et al.  Learning Novel Policies For Tasks , 2019, ICML.

[61]  Andrew K. Lampinen,et al.  What shapes feature representations? Exploring datasets, architectures, and training , 2020, NeurIPS.

[62]  Pascal Bianchi,et al.  Convergence Analysis of a Momentum Algorithm with Adaptive Step Size for Non Convex Optimization , 2019, ArXiv.