Noise Regularization for Conditional Density Estimation

Modelling statistical relationships beyond the conditional mean is crucial in many settings. Conditional density estimation (CDE) aims to learn the full conditional probability density from data. Though highly expressive, neural network based CDE models can suffer from severe over-fitting when trained with the maximum likelihood objective. Due to the inherent structure of such models, classical regularization approaches in the parameter space are rendered ineffective. To address this issue, we develop a model-agnostic noise regularization method for CDE that adds random perturbations to the data during training. We demonstrate that the proposed approach corresponds to a smoothness regularization and prove its asymptotic consistency. In our experiments, noise regularization significantly and consistently outperforms other regularization methods across seven data sets and three CDE models. The effectiveness of noise regularization makes neural network based CDE the preferable method over previous non- and semi-parametric approaches, even when training data is scarce.

[1]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[2]  Takafumi Kanamori,et al.  Conditional Density Estimation via Least-Squares Density Ratio Estimation , 2010, AISTATS.

[3]  Tianqi Chen,et al.  Stochastic Gradient Hamiltonian Monte Carlo , 2014, ICML.

[4]  Rob J Hyndman,et al.  Estimating and Visualizing Conditional Densities , 1996 .

[5]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[6]  Alan F. Murray,et al.  Synaptic Weight Noise During MLP Learning Enhances Fault-Tolerance, Generalization and Learning Trajectory , 1992, NIPS.

[7]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[8]  M. Rudemo Empirical Choice of Histograms and Kernel Density Estimators , 1982 .

[9]  Frank Nielsen WHAT IS… an Information Projection? , 2018 .

[10]  Daniel Cremers,et al.  Regularization for Deep Learning: A Taxonomy , 2017, ArXiv.

[11]  Petri Koistinen,et al.  Using additive noise in back-propagation training , 1992, IEEE Trans. Neural Networks.

[12]  Fábio Ferreira,et al.  Conditional Density Estimation with Neural Networks: Best Practices and Benchmarks , 2019, ArXiv.

[13]  Lorien Y. Pratt,et al.  Comparing Biases for Minimal Network Construction with Back-Propagation , 1988, NIPS.

[14]  Bernhard Schölkopf,et al.  Improving the Accuracy and Speed of Support Vector Machines , 1996, NIPS.

[15]  D. Mackay,et al.  A Practical Bayesian Framework for Backprop Networks , 1991 .

[16]  Richard E. Turner,et al.  Conditional Density Estimation with Bayesian Normalising Flows , 2018, 1802.04908.

[17]  Andrew R. Webb,et al.  Functional approximation by feed-forward networks: a least-squares approach to generalization , 1994, IEEE Trans. Neural Networks.

[18]  Nagarajan Natarajan,et al.  Learning with Noisy Labels , 2013, NIPS.

[19]  David W. Scott,et al.  Feasibility of multivariate density estimates , 1991 .

[20]  M. Rosenblatt Remarks on Some Nonparametric Estimates of a Density Function , 1956 .

[21]  James Hensman,et al.  Gaussian Process Conditional Density Estimation , 2018, NeurIPS.

[22]  Halbert White,et al.  Learning in Artificial Neural Networks: A Statistical Perspective , 1989, Neural Computation.

[23]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[24]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[25]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[26]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[27]  Jocelyn Sietsma,et al.  Creating artificial neural networks that generalize , 1991, Neural Networks.

[28]  M. C. Jones,et al.  A reliable data-based bandwidth selection method for kernel density estimation , 1991 .

[29]  L. Devroye The Equivalence of Weak, Strong and Complete Convergence in $L_1$ for Kernel Density Estimates , 1983 .

[30]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[31]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[32]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[33]  Geoffrey E. Hinton,et al.  Simplifying Neural Networks by Soft Weight-Sharing , 1992, Neural Computation.

[34]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[35]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[36]  M. V. Gerven,et al.  The Kernel Mixture Network: A Nonparametric Method for Conditional Density Estimation of Continuous Random Variables , 2017, 1705.07111.

[37]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[38]  Pan He,et al.  Adversarial Examples: Attacks and Defenses for Deep Learning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[39]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[40]  B. Silverman,et al.  On the Estimation of a Probability Density Function by the Maximum Penalized Likelihood Method , 1982 .

[41]  Volker Tresp,et al.  Mixtures of Gaussian Processes , 2000, NIPS.

[42]  Christopher M. Bishop,et al.  Current address: Microsoft Research, , 2022 .

[43]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[44]  A. Cuevas,et al.  A comparative study of several smoothing methods in density estimation , 1994 .

[45]  Qi Li,et al.  Nonparametric Econometrics: Theory and Practice , 2006 .

[46]  Ivan Netuka,et al.  On threshold autoregressive processes , 1984, Kybernetika.

[47]  L. Devroye A Course in Density Estimation , 1987 .

[48]  Ronaldo Dias,et al.  A Review of Kernel Density Estimation with Applications to Econometrics , 2012, 1212.2812.

[49]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[50]  Stephen Tyree,et al.  Learning with Marginalized Corrupted Features , 2013, ICML.

[51]  H. Sung Gaussian Mixture Regression and Classification , 2004 .

[52]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[53]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[54]  S. Srihari Mixture Density Networks , 1994 .

[55]  Samy Bengio,et al.  Conditional Gaussian mixture models for environmental risk mapping , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[56]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[57]  J. Marron,et al.  Smoothed cross-validation , 1992 .