Estimation of Non-Normalized Statistical Models by Score Matching

One often wants to estimate statistical models where the probability density function is known only up to a multiplicative normalization constant. Typically, one then has to resort to Markov Chain Monte Carlo methods, or approximations of the normalization constant. Here, we propose that such models can be estimated by minimizing the expected squared distance between the gradient of the log-density given by the model and the gradient of the log-density of the observed data. While the estimation of the gradient of log-density function is, in principle, a very difficult non-parametric problem, we prove a surprising result that gives a simple formula for this objective function. The density function of the observed data does not appear in this formula, which simplifies to a sample average of a sum of some derivatives of the log-density given by the model. The validity of the method is demonstrated on multivariate Gaussian and independent component analysis models, and by estimating an overcomplete filter set for natural image data.

[1]  J. Besag Spatial Interaction and the Statistical Analysis of Lattice Systems , 1974 .

[2]  J. Besag Efficiency of pseudolikelihood estimation for simple Gaussian fields , 1977 .

[3]  Ken D. Sauer,et al.  A generalized Gaussian image model for edge-preserving MAP estimation , 1993, IEEE Trans. Image Process..

[4]  F. Y. Edgeworth,et al.  The theory of statistics , 1996 .

[5]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[6]  Philippe Garat,et al.  Blind separation of mixture of independent sources through a quasi-maximum likelihood approach , 1997, IEEE Trans. Signal Process..

[7]  Aapo Hyvärinen,et al.  A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images , 2001, Vision Research.

[8]  Stan Z. Li,et al.  Markov Random Field Modeling in Image Analysis , 2001, Computer Science Workbench.

[9]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[10]  Yee Whye Teh,et al.  Energy-Based Models for Sparse Overcomplete Representations , 2003, J. Mach. Learn. Res..

[11]  Aapo Hyvärinen,et al.  Estimating Overcomplete Independent Component Bases for Image Windows , 2002, Journal of Mathematical Imaging and Vision.

[12]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[13]  Miguel Á. Carreira-Perpiñán,et al.  On Contrastive Divergence Learning , 2005, AISTATS.