Convergence Rates for Gaussian Mixtures of Experts

We provide a theoretical treatment of over-specified Gaussian mixtures of experts with covariate-free gating networks. We establish the convergence rates of the maximum likelihood estimation (MLE) for these models. Our proof technique is based on a novel notion of \emph{algebraic independence} of the expert functions. Drawing on optimal transport theory, we establish a connection between the algebraic independence and a certain class of partial differential equations (PDEs). Exploiting this connection allows us to derive convergence rates and minimax lower bounds for parameter estimation.

[1]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[2]  A. V. D. Vaart,et al.  Entropies and rates of convergence for maximum likelihood and Bayes estimation for mixtures of normal densities , 2001 .

[3]  B. Lindsay Mixture models : theory, geometry, and applications , 1995 .

[4]  Runze Li,et al.  Nonparametric Mixture of Regression Models , 2013, Journal of the American Statistical Association.

[5]  X. Nguyen Convergence of latent mixing measures in finite and infinite mixture models , 2011, 1109.3250.

[6]  Carl E. Rasmussen,et al.  Infinite Mixtures of Gaussian Process Experts , 2001, NIPS.

[7]  H. Teicher On the Mixture of Distributions , 1960 .

[8]  Sreeram Kannan,et al.  Breaking the gridlock in Mixture-of-Experts: Consistent and Efficient Algorithms , 2018, ICML.

[9]  H. Teicher Identifiability of Mixtures , 1961 .

[10]  W. Yao,et al.  Mixture of Regression Models With Varying Mixing Proportions: A Semiparametric Approach , 2012 .

[11]  C. Villani Topics in Optimal Transportation , 2003 .

[12]  J. Kahn,et al.  Strong identifiability and optimal minimax rates for finite mixture estimation , 2018, The Annals of Statistics.

[13]  Marc'Aurelio Ranzato,et al.  Learning Factored Representations in a Deep Mixture of Experts , 2013, ICLR.

[14]  Nhat Ho,et al.  Singularity Structures and Impacts on Parameter Estimation in Finite Mixtures of Distributions , 2016, SIAM J. Math. Data Sci..

[15]  Yuichi Kitamura,et al.  Using Mixtures in Econometric Models: A Brief Review and Some New Results , 2016 .

[16]  M. Tanner,et al.  Hierarchical mixtures-of-experts for exponential family regression models: approximation and maximum , 1999 .

[17]  S. Geer Empirical Processes in M-Estimation , 2000 .

[18]  H. Teicher Identifiability of Finite Mixtures , 1963 .

[19]  Wenxin Jiang,et al.  On the identifiability of mixtures-of-experts , 1999, Neural Networks.

[20]  K. Mengersen,et al.  Asymptotic behaviour of the posterior distribution in overfitted mixture models , 2011 .

[21]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[22]  Sewoong Oh,et al.  Learning in Gated Neural Networks , 2019, AISTATS.

[23]  Wenxin Jiang,et al.  Hierarchical mixtures-of-experts for generalized linear models: some results on denseness and consistency , 1999, AISTATS.

[24]  Jiahua Chen Consistency of the MLE under mixture models , 2016, 1607.01251.

[25]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[26]  Michael I. Jordan,et al.  Convergence results for the EM approach to mixtures of experts architectures , 1995, Neural Networks.

[27]  Fengchun Peng,et al.  Bayesian Inference in Mixtures-of-Experts and Hierarchical Mixtures-of-Experts Models With an Applic , 1996 .

[28]  Jiahua Chen Optimal Rate of Convergence for Finite Mixture Models , 1995 .

[29]  Jiahua Chen,et al.  Variable Selection in Finite Mixture of Regression Models , 2007 .

[30]  Wenxin Jiang,et al.  On the Approximation Rate of Hierarchical Mixtures-of-Experts for Generalized Linear Models , 1999, Neural Computation.