论文信息 - Learning explanations that are hard to vary - 字舞流文

Learning explanations that are hard to vary

In this paper, we investigate the principle that `good explanations are hard to vary' in the context of deep learning. We show that averaging gradients across examples -- akin to a logical OR of patterns -- can favor memorization and `patchwork' solutions that sew together different strategies, instead of identifying invariances. To inspect this, we first formalize a notion of consistency for minima of the loss surface, which measures to what extent a minimum appears only when examples are pooled. We then propose and experimentally validate a simple alternative algorithm based on a logical AND, that focuses on invariances and prevents memorization in a set of real-world tasks. Finally, using a synthetic dataset with a clear distinction between invariant and spurious mechanisms, we dissect learning signals and compare this approach to well-established regularizers.

B. Schölkopf | Giambattista Parascandolo | Alexander Neitz | Antonio Orvieto | Luigi Gresele | B. Scholkopf

[1] Aaron C. Courville,et al. Out-of-Distribution Generalization via Risk Extrapolation (REx) , 2020, ICML.

[2] Andrew Kyle Lampinen,et al. What shapes feature representations? Exploring datasets, architectures, and training , 2020, NeurIPS.

[3] Dan Alistarh,et al. WoodFisher: Efficient second-order approximations for model compression , 2020, ArXiv.

[4] Kush R. Varshney,et al. Invariant Risk Minimization Games , 2020, ICML.

[5] Michael W. Mahoney,et al. PyHessian: Neural Networks Through the Lens of the Hessian , 2019, 2020 IEEE International Conference on Big Data (Big Data).

[6] J. Schulman,et al. Leveraging Procedural Generation to Benchmark Reinforcement Learning , 2019, ICML.

[7] Christina Heinze-Deml,et al. Conditional variance penalties and domain shift robustness , 2017, Machine Learning.

[8] Bernhard Schölkopf,et al. Causality for Machine Learning , 2019, ArXiv.

[9] Anas Barakat,et al. Convergence Analysis of a Momentum Algorithm with Adaptive Step Size for Non Convex Optimization , 2019, ArXiv.

[10] Guodong Zhang,et al. Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , 2019, NeurIPS.

[11] David Lopez-Paz,et al. Invariant Risk Minimization , 2019, ArXiv.

[12] Dominik Janzing,et al. Causal Regularization , 2019, NeurIPS.

[13] Xiaoxia Wu,et al. AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization , 2018, ICML.

[14] Aurélien Lucchi,et al. Ellipsoidal Trust Region Methods and the Marginal Value of Hessian Information for Neural Network Training , 2019, ArXiv.

[15] Greg Turk,et al. Learning Novel Policies For Tasks , 2019, ICML.

[16] Srini Narayanan,et al. Stiffness: A New Perspective on Generalization in Neural Networks , 2019, ArXiv.

[17] Suchi Saria,et al. Preventing Failures Due to Dataset Shift: Learning Predictive Models That Transport , 2018, AISTATS.

[18] Taehoon Kim,et al. Quantifying Generalization in Reinforcement Learning , 2018, ICML.

[19] Marco Loog,et al. Semi-Generative Modelling: Covariate-Shift Adaptation with Cause and Effect Features , 2018, AISTATS.

[20] Swami Sankaranarayanan,et al. MetaReg: Towards Domain Generalization using Meta-Regularization , 2018, NeurIPS.

[21] Massoud Pedram,et al. Gradient Agreement as an Optimization Objective for Meta-Learning , 2018, ArXiv.

[22] Razvan Pascanu,et al. Adapting Auxiliary Losses Using Gradient Similarity , 2018, ArXiv.

[23] Thomas Hofmann,et al. Escaping Saddles with Stochastic Gradients , 2018, ICML.

[24] Bernhard Schölkopf,et al. Learning Independent Causal Mechanisms , 2017, ICML.

[25] Michael I. Jordan,et al. Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent , 2017, COLT.

[26] Christina Heinze-Deml,et al. Invariant Causal Prediction for Nonlinear Models , 2017, Journal of Causal Inference.

[27] Yann Dauphin,et al. Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.

[28] Jorge Nocedal,et al. Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[29] Bernhard Schölkopf,et al. Invariant Models for Causal Transfer Learning , 2015, J. Mach. Learn. Res..

[30] Bernhard Schölkopf,et al. Elements of Causal Inference: Foundations and Learning Algorithms , 2017 .

[31] Yoshua Bengio,et al. Three Factors Influencing Minima in SGD , 2017, ArXiv.

[32] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[33] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[34] David M. Blei,et al. Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[35] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.

[36] Shiqian Ma,et al. Stochastic Quasi-Newton Methods for Nonconvex Stochastic Optimization , 2014, SIAM J. Optim..

[37] Michael I. Jordan,et al. Gradient Descent Converges to Minimizers , 2016, ArXiv.

[38] François Laviolette,et al. Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[39] Alexandre M. Bayen,et al. Accelerated Mirror Descent in Continuous and Discrete Time , 2015, NIPS.

[40] Jonas Peters,et al. Causal inference by using invariant prediction: identification and confidence intervals , 2015, 1501.01332.

[41] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[42] Saeed Ghadimi,et al. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[43] David D. Cox,et al. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.

[44] Bernhard Schölkopf,et al. Domain Generalization via Invariant Feature Representation , 2013, ICML.

[45] Howard Barnum,et al. The Beginning of Infinity: Explanations That Transform the World , 2012 .

[46] Bernhard Schölkopf,et al. On causal and anticausal learning , 2012, ICML.

[47] Motoaki Kawanabe,et al. Machine Learning in Non-Stationary Environments - Introduction to Covariate Shift Adaptation , 2012, Adaptive computation and machine learning.

[48] Bernhard Schölkopf,et al. Regression by dependence minimization and its application to causal inference in additive noise models , 2009, ICML '09.

[49] Neil D. Lawrence,et al. Dataset Shift in Machine Learning , 2009 .

[50] Carl E. Rasmussen,et al. Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[51] Klaus-Robert Müller,et al. Covariate Shift Adaptation by Importance Weighted Cross Validation , 2007, J. Mach. Learn. Res..

[52] H. Kushner,et al. Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[53] Chi-Kwong Li. Geometric Means , 2003 .

[54] Vladimir N. Vapnik,et al. The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[55] J. Pearl. Causality: Models, Reasoning and Inference , 2000 .

[56] Stephen J. Wright,et al. Numerical Optimization , 2018, Fundamental Statistical Inference.

[57] K. Hoover. The Logic of Causal Inference: Econometrics and the Conditional Analysis of Causation , 1990, Economics and Philosophy.

[58] George Cybenko,et al. Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[59] Yann LeCun,et al. Improving the convergence of back-propagation learning with second-order methods , 1989 .

[60] H. Simon,et al. Causal Ordering and Identifiability , 1977 .

[61] Leonid Hurwicz,et al. On the Structural Form of Interdependent Systems , 1966 .

[62] T. Haavelmo. The Statistical Implications of a System of Simultaneous Equations , 1943 .