Estimation of mixture models

We analyze mixture density approximation and estimation. We form a convex set of density functions by taking the convex hull of a parametric family, e.g. mixtures of the Gaussian location family. A sequence of finite mixture densities is formulated to provide a parsimonious approximation for the target density. If the target density itself is in the convex hull, we show that the approximation error goes to zero with a rate of 1/k, where k is the number of components in the approximation. If the target density is outside of the convex hull, the approximation error is equal to the best achievable error plus a term that goes to zero with a rate of 1/k. A greedy algorithm that introduces one component at each step is shown to achieve such an error rate. Similarly, a greedy estimation algorithm is provided to find such approximation for data from an arbitrary density. This algorithm estimates one mixture component at one time. We prove that such an algorithm achieves a likelihood nearly as good as the MLE (maximum likelihood estimate) over the whole convex hull. And we identify the difference as being bounded by order O(1/k), where k is the number of components in the estimate. Risks of such estimators are shown to be bounded by a sum of approximation error and estimation error. The error terms are identified. An optimal choice of k can be derived by minimizing the risk bound. Acting as a similar role as the bandwidth in non-parainetric density estimation, k controls two error terms in opposite directions. A large k reduces approximation error and increases estimation error. A MDL (minimum description length) principle is derived to provide an estimation method for k. And the estimated k is shown to achieve the risk bound as if we know the best k in advance. A new information projection theory is derived to expand the approximating class to include its information closure. We prove the existence and uniqueness of a f* in the closure of the convex hull C (in a sense we identify), such that D ( fpf* ) = infg∈CD fpg , where Dfpg is the Kullback-Leibler divergence. And log(fk) → log(f*) in L1 (f) for any sequence fk in C with Dfpfk→ infg∈CD fpg. Other characterizing properties of f* are also given.

[1]  J. Wolfowitz The Minimum Distance Method , 1957 .

[2]  I. Csiszár $I$-Divergence Geometry of Probability Distributions and Minimization Problems , 1975 .

[3]  L. Simar Maximum Likelihood Estimation of a Compound Poisson Process , 1976 .

[4]  R. Beran Minimum Hellinger distance estimates for parametric models , 1977 .

[5]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[6]  Robert M. Bell,et al.  Competitive Optimality of Logarithmic Investment , 1980, Math. Oper. Res..

[7]  A. Cohen,et al.  Finite Mixture Distributions , 1982 .

[8]  I. Csiszár Sanov Property, Generalized $I$-Projection and a Conditional Limit Theorem , 1984 .

[9]  A. Barron THE STRONG ERGODIC THEOREM FOR DENSITIES: GENERALIZED SHANNON-MCMILLAN-BREIMAN THEOREM' , 1985 .

[10]  J. Hartigan A failure of likelihood asymptotics for normal mixtures , 1985 .

[11]  Y. Yatracos Rates of Convergence of Minimum Distance Estimators and Kolmogorov's Entropy , 1985 .

[12]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[13]  A. Barron ENTROPY AND THE CENTRAL LIMIT THEOREM , 1986 .

[14]  R. Tapia,et al.  Nonparametric Function Estimation, Modeling, and Simulation , 1987 .

[15]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[16]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[17]  T. Cover,et al.  Game-theoretic optimal portfolios , 1988 .

[18]  David E. Tyler,et al.  Redescending $M$-Estimates of Multivariate Location and Scatter , 1991 .

[19]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[20]  Andrew R. Barron,et al.  Minimum complexity density estimation , 1991, IEEE Trans. Inf. Theory.

[21]  L. Jones A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training , 1992 .

[22]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[23]  R. Jaszczak,et al.  Parameter estimation of finite mixtures using the EM algorithm and information criteria with application to medical image processing , 1992 .

[24]  Azriel Rosenfeld,et al.  Model-based cluster analysis , 1993, Pattern Recognit..

[25]  F. O’Sullivan,et al.  Metabolic images from dynamic positron emission tomography studies , 1994, Statistical methods in medical research.

[26]  Z. D. Feng,et al.  Using Bootstrap Likelihood Ratio in Finite Mixture Models , 1994 .

[27]  W. Wong,et al.  Probability inequalities for likelihood ratios and convergence rates of sieve MLEs , 1995 .

[28]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[29]  Peter L. Bartlett,et al.  Efficient agnostic learning of neural networks with bounded fan-in , 1996, IEEE Trans. Inf. Theory.

[30]  P. Green,et al.  On Bayesian Analysis of Mixtures with an Unknown Number of Components (with discussion) , 1997 .

[31]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[32]  Yuhong Yang,et al.  An Asymptotic Property of Model Selection Criteria , 1998, IEEE Trans. Inf. Theory.

[33]  Andreas S. Weigend,et al.  Computing portfolio risk using Gaussian mixtures and independent component analysis , 1999, Proceedings of the IEEE/IAFE 1999 Conference on Computational Intelligence for Financial Engineering (CIFEr) (IEEE Cat. No.99TH8408).