A non-asymptotic model selection in block-diagonal mixture of polynomial experts models

Model selection via penalized likelihood type criteria is a standard task in many statistical inference and machine learning problems. It has led to deriving criteria with asymptotic consistency results and an increasing emphasis on introducing non-asymptotic criteria. We focus on the problem of modeling non-linear relationships in regression data with potential hidden graph-structured interactions between the high-dimensional predictors, within the mixture of experts modeling framework. In order to deal with such a complex situation, we investigate a block-diagonal localized mixture of polynomial experts (BLoMPE) regression model, which is constructed upon an inverse regression and block-diagonal structures of the Gaussian expert covariance matrices. We introduce a penalized maximum likelihood selection criterion to estimate the unknown conditional density of the regression model. This model selection criterion allows us to handle the challenging problem of inferring the number of mixture components, the degree of polynomial mean functions, and the hidden block-diagonal structures of the covariance matrices, which reduces the number of parameters to be estimated and leads to a trade-off between complexity and sparsity in the model. In particular, we provide a strong theoretical guarantee: a finite-sample oracle inequality satisfied by the penalized maximum likelihood estimator with a Jensen-Kullback-Leibler type loss, to support the introduced non-asymptotic model selection criterion. The penalty shape of this criterion depends on the complexity of the considered random subcollection of BLoMPE models, including the relevant graph structures, the degree of polynomial mean functions, and the number of mixture components.

[1]  Geoffrey E. Hinton,et al.  An Alternative Model for Mixtures of Experts , 1994, NIPS.

[2]  Caroline Meynet,et al.  An l1-oracle inequality for the Lasso in finite mixture Gaussian regression models , 2013 .

[3]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[4]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[5]  Erwan Le Pennec,et al.  Mixture of Gaussian regressions model with logistic weights, a penalized maximum likelihood approach , 2014 .

[6]  P. Massart,et al.  Minimum contrast estimators on sieves: exponential bounds and rates of convergence , 1998 .

[7]  Ker-Chau Li,et al.  Sliced Inverse Regression for Dimension Reduction , 1991 .

[8]  Faicel Chamroukhi,et al.  Practical and theoretical aspects of mixture‐of‐experts modeling: An overview , 2018, Wiley Interdiscip. Rev. Data Min. Knowl. Discov..

[9]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[10]  A. Barron,et al.  THE MDL PRINCIPLE , PENALIZED LIKELIHOODS , AND STATISTICAL RISK , 2008 .

[11]  Erwan Le Pennec,et al.  Conditional Density Estimation by Penalized Likelihood Model Selection and Applications , 2011, 1103.2021.

[12]  Giorgio Vittadini,et al.  Local Statistical Modeling via a Cluster-Weighted Approach with Elliptical Distributions , 2012, J. Classif..

[13]  P. Massart,et al.  An l1-Oracle Inequality for the Lasso , 2010, 1007.4791.

[14]  F. Hall,et al.  Approximation of conditional densities by smooth mixtures of regressions ∗ , 2009 .

[15]  G. McLachlan,et al.  Approximation by finite mixtures of continuous density functions that vanish at infinity , 2019, Cogent Mathematics & Statistics.

[16]  Emilie Devijver,et al.  Finite mixture regression: A sparse variable selection by model selection for clustering , 2014, 1409.1331.

[17]  Joydeep Ghosh,et al.  Structural adaptation in mixture of experts , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[18]  P. Massart,et al.  Minimal Penalties for Gaussian Model Selection , 2007 .

[19]  Geoffrey J. McLachlan,et al.  A Universal Approximation Theorem for Mixture-of-Experts Models , 2016, Neural Computation.

[20]  Radu Horaud,et al.  Hyper-Spectral Image Analysis With Partially Latent Regression and Spatial Markov Dependencies , 2014, IEEE Journal of Selected Topics in Signal Processing.

[21]  Faicel Chamroukhi,et al.  Approximations of conditional probability density functions in Lebesgue spaces via mixture of experts models , 2021, Journal of Statistical Distributions and Applications.

[22]  Bertrand Michel,et al.  Slope heuristics: overview and implementation , 2011, Statistics and Computing.

[23]  Rafael Muñoz-Salinas,et al.  Deep Mixture of Linear Inverse Regressions Applied to Head-Pose Estimation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  H. Akaike A new look at the statistical model identification , 1974 .

[25]  X. Nguyen Convergence of latent mixing measures in finite and infinite mixture models , 2011, 1109.3250.

[26]  M. Tanner,et al.  Hierarchical mixtures-of-experts for exponential family regression models: approximation and maximum , 1999 .

[27]  Abbas Khalili New estimation and feature selection methods in mixture‐of‐experts models , 2010 .

[28]  Radu Horaud,et al.  High-dimensional regression with gaussian mixtures and partially-latent response variables , 2013, Statistics and Computing.

[29]  M. Kosorok Introduction to Empirical Processes and Semiparametric Inference , 2008 .

[30]  Emilie Devijver,et al.  Block-Diagonal Covariance Selection for High-Dimensional Gaussian Graphical Models , 2015, ArXiv.

[31]  Pascal Massart,et al.  The Lasso as an l1-ball model selection procedure , 2011 .

[32]  Nhat Ho,et al.  Convergence rates of parameter estimation for some weakly identifiable finite mixtures , 2016 .

[33]  Cathy Maugis,et al.  A non asymptotic penalized criterion for Gaussian mixture model selection , 2011 .

[34]  Faicel Chamroukhi,et al.  A non-asymptotic penalization criterion for model selection in mixture of experts models , 2021, ArXiv.

[35]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[36]  Emilie Devijver,et al.  Model-based regression clustering for high-dimensional data: application to functional data , 2014, Adv. Data Anal. Classif..

[37]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[38]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[39]  David R. Anderson,et al.  Bayesian Methods in Cosmology: Model selection and multi-model inference , 2009 .

[40]  Joseph N. Wilson,et al.  Twenty Years of Mixture of Experts , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[41]  Shin Ishii,et al.  On-line EM Algorithm for the Normalized Gaussian Network , 2000, Neural Computation.

[42]  Approximation of probability density functions via location-scale finite mixtures in Lebesgue spaces , 2020, 2008.09787.

[43]  Colin L. Mallows,et al.  Some Comments on Cp , 2000, Technometrics.

[44]  L. Wasserman,et al.  RATES OF CONVERGENCE FOR THE GAUSSIAN MIXTURE SIEVE , 2000 .

[45]  Sylvain Arlot,et al.  Minimal penalties and the slope heuristics: a survey , 2019, 1901.07277.

[46]  Cathy Maugis,et al.  Data-driven penalty calibration : a case study for Gaussian mixture model selection , 2011 .

[47]  Perry D. Moerland,et al.  Classification using localized mixtures of experts , 1999 .

[48]  Nhat Ho,et al.  On strong identifiability and convergence rates of parameter estimation in finite mixtures , 2016 .

[49]  Andriy Norets,et al.  POSTERIOR CONSISTENCY IN CONDITIONAL DENSITY ESTIMATION BY COVARIATE DEPENDENT MIXTURES , 2011, Econometric Theory.

[50]  D. Parkinson,et al.  Bayesian Methods in Cosmology: Model selection and multi-model inference , 2009 .

[51]  Philippe Besse,et al.  Statistical Applications in Genetics and Molecular Biology A Sparse PLS for Variable Selection when Integrating Omics Data , 2011 .

[52]  Guillaume Bouchard,et al.  Localised Mixtures of Experts for Mixture of Regressions , 2003 .

[53]  M. Gallopin,et al.  Nonlinear network-based quantitative trait prediction from transcriptomic data , 2017, 1701.07899.

[54]  G. McLachlan,et al.  An l1-oracle inequality for the Lasso in mixture-of-experts regression models , 2020, ArXiv.

[55]  D. Pati,et al.  ADAPTIVE BAYESIAN ESTIMATION OF CONDITIONAL DENSITIES , 2014, Econometric Theory.

[56]  Hien D. Nguyen,et al.  Regularized Estimation and Feature Selection in Mixtures of Gaussian-Gated Experts Models , 2019, Communications in Computer and Information Science.

[57]  Antoine Deleforge,et al.  Inverse regression approach to robust nonlinear high-to-low dimensional mapping , 2018, J. Multivar. Anal..

[58]  Faicel Chamroukhi,et al.  Approximation results regarding the multiple-output Gaussian gated mixture of linear experts model , 2019, Neurocomputing.

[59]  Wenxin Jiang,et al.  On Convergence Rates of Mixtures of Polynomial Experts , 2012, Neural Computation.

[60]  Michael I. Jordan,et al.  Convergence Rates for Gaussian Mixtures of Experts , 2019, J. Mach. Learn. Res..

[61]  Emilie Devijver,et al.  Joint rank and variable selection for parsimonious estimation in a high-dimensional finite mixture regression model , 2015, J. Multivar. Anal..

[62]  Joydeep Ghosh,et al.  Use of localized gating in mixture of experts networks , 1998, Defense, Security, and Sensing.