论文信息 - Learning explanations that are hard to vary - 字舞流文

Learning explanations that are hard to vary

In this paper, we investigate the principle that `good explanations are hard to vary' in the context of deep learning. We show that averaging gradients across examples -- akin to a logical OR of patterns -- can favor memorization and `patchwork' solutions that sew together different strategies, instead of identifying invariances. To inspect this, we first formalize a notion of consistency for minima of the loss surface, which measures to what extent a minimum appears only when examples are pooled. We then propose and experimentally validate a simple alternative algorithm based on a logical AND, that focuses on invariances and prevents memorization in a set of real-world tasks. Finally, using a synthetic dataset with a clear distinction between invariant and spurious mechanisms, we dissect learning signals and compare this approach to well-established regularizers.

Luigi Gresele | Bernhard Schölkopf | Antonio Orvieto | Giambattista Parascandolo | Alexander Neitz

[1] Dominik Janzing,et al. Causal Regularization , 2019, NeurIPS.

[2] K. Hoover. The Logic of Causal Inference: Econometrics and the Conditional Analysis of Causation , 1990, Economics and Philosophy.

[3] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[4] Yann LeCun,et al. Improving the convergence of back-propagation learning with second-order methods , 1989 .

[5] Michael I. Jordan,et al. Gradient Descent Converges to Minimizers , 2016, ArXiv.

[6] Marco Loog,et al. Semi-Generative Modelling: Covariate-Shift Adaptation with Cause and Effect Features , 2018, AISTATS.

[7] Michael W. Mahoney,et al. PyHessian: Neural Networks Through the Lens of the Hessian , 2019, 2020 IEEE International Conference on Big Data (Big Data).

[8] Guodong Zhang,et al. Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , 2019, NeurIPS.

[9] Bernhard Schölkopf,et al. Domain Generalization via Invariant Feature Representation , 2013, ICML.

[10] Suchi Saria,et al. Preventing Failures Due to Dataset Shift: Learning Predictive Models That Transport , 2018, AISTATS.

[11] Neil D. Lawrence,et al. Dataset Shift in Machine Learning , 2009 .

[12] Aurélien Lucchi,et al. Ellipsoidal Trust Region Methods and the Marginal Value of Hessian Information for Neural Network Training , 2019, ArXiv.

[13] Klaus-Robert Müller,et al. Covariate Shift Adaptation by Importance Weighted Cross Validation , 2007, J. Mach. Learn. Res..

[14] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.

[15] H. Simon,et al. Causal Ordering and Identifiability , 1977 .

[16] Chi-Kwong Li. Geometric Means , 2003 .

[17] Bernhard Schölkopf,et al. Regression by dependence minimization and its application to causal inference in additive noise models , 2009, ICML '09.

[18] J. Pearl. Causality: Models, Reasoning and Inference , 2000 .

[19] Michael I. Jordan,et al. Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent , 2017, COLT.

[20] George Cybenko,et al. Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[21] David D. Cox,et al. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.

[22] Bernhard Schölkopf,et al. On causal and anticausal learning , 2012, ICML.

[23] Dan Alistarh,et al. WoodFisher: Efficient second-order approximations for model compression , 2020, ArXiv.

[24] H. Kushner,et al. Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[25] Christina Heinze-Deml,et al. Conditional variance penalties and domain shift robustness , 2017, Machine Learning.

[26] Bernhard Schölkopf,et al. Invariant Models for Causal Transfer Learning , 2015, J. Mach. Learn. Res..

[27] Christina Heinze-Deml,et al. Invariant Causal Prediction for Nonlinear Models , 2017, Journal of Causal Inference.

[28] Aaron C. Courville,et al. Out-of-Distribution Generalization via Risk Extrapolation (REx) , 2020, ICML.

[29] Jonas Peters,et al. Causal inference by using invariant prediction: identification and confidence intervals , 2015, 1501.01332.

[30] Massoud Pedram,et al. Gradient Agreement as an Optimization Objective for Meta-Learning , 2018, ArXiv.

[31] Howard Barnum,et al. The Beginning of Infinity: Explanations That Transform the World , 2012 .

[32] Bernhard Schölkopf,et al. Learning Independent Causal Mechanisms , 2017, ICML.

[33] Yann Dauphin,et al. Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.

[34] Razvan Pascanu,et al. Adapting Auxiliary Losses Using Gradient Similarity , 2018, ArXiv.

[35] Taehoon Kim,et al. Quantifying Generalization in Reinforcement Learning , 2018, ICML.

[36] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37] Stephen J. Wright,et al. Numerical Optimization , 2018, Fundamental Statistical Inference.

[38] Alexandre M. Bayen,et al. Accelerated Mirror Descent in Continuous and Discrete Time , 2015, NIPS.

[39] Srini Narayanan,et al. Stiffness: A New Perspective on Generalization in Neural Networks , 2019, ArXiv.

[40] Bernhard Schölkopf,et al. Elements of Causal Inference: Foundations and Learning Algorithms , 2017 .

[41] Swami Sankaranarayanan,et al. MetaReg: Towards Domain Generalization using Meta-Regularization , 2018, NeurIPS.

[42] T. Haavelmo. The Statistical Implications of a System of Simultaneous Equations , 1943 .

[43] Vladimir N. Vapnik,et al. The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[44] Shiqian Ma,et al. Stochastic Quasi-Newton Methods for Nonconvex Stochastic Optimization , 2014, SIAM J. Optim..

[45] J. Schulman,et al. Leveraging Procedural Generation to Benchmark Reinforcement Learning , 2019, ICML.

[46] Yoshua Bengio,et al. Three Factors Influencing Minima in SGD , 2017, ArXiv.

[47] David Lopez-Paz,et al. Invariant Risk Minimization , 2019, ArXiv.

[48] Motoaki Kawanabe,et al. Machine Learning in Non-Stationary Environments - Introduction to Covariate Shift Adaptation , 2012, Adaptive computation and machine learning.

[49] Xiaoxia Wu,et al. AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization , 2018, ICML.

[50] Thomas Hofmann,et al. Escaping Saddles with Stochastic Gradients , 2018, ICML.

[51] Leonid Hurwicz,et al. On the Structural Form of Interdependent Systems , 1966 .

[52] Carl E. Rasmussen,et al. Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[53] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[54] François Laviolette,et al. Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[55] Jorge Nocedal,et al. Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[56] Saeed Ghadimi,et al. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[57] Bernhard Schölkopf,et al. Causality for Machine Learning , 2019, ArXiv.

[58] Amit Dhurandhar,et al. Invariant Risk Minimization Games , 2020, ICML.

[59] David M. Blei,et al. Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[60] Greg Turk,et al. Learning Novel Policies For Tasks , 2019, ICML.

[61] Andrew K. Lampinen,et al. What shapes feature representations? Exploring datasets, architectures, and training , 2020, NeurIPS.

[62] Pascal Bianchi,et al. Convergence Analysis of a Momentum Algorithm with Adaptive Step Size for Non Convex Optimization , 2019, ArXiv.