Non-Normal Mixtures of Experts

Mixture of Experts (MoE) is a popular framework for modeling heterogeneity in data for regression, classification and clustering. For continuous data which we consider here in the context of regression and cluster analysis, MoE usually use normal experts, that is, expert components following the Gaussian distribution. However, for a set of data containing a group or groups of observations with asymmetric behavior, heavy tails or atypical observations, the use of normal experts may be unsuitable and can unduly affect the fit of the MoE model. In this paper, we introduce new non-normal mixture of experts (NNMoE) which can deal with these issues regarding possibly skewed, heavy-tailed data and with outliers. The proposed models are the skew-normal MoE and the robust $t$ MoE and skew $t$ MoE, respectively named SNMoE, TMoE and STMoE. We develop dedicated expectation-maximization (EM) and expectation conditional maximization (ECM) algorithms to estimate the parameters of the proposed models by monotonically maximizing the observed data log-likelihood. We describe how the presented models can be used in prediction and in model-based clustering of regression data. Numerical experiments carried out on simulated data show the effectiveness and the robustness of the proposed models in terms modeling non-linear regression functions as well as in model-based clustering. Then, to show their usefulness for practical applications, the proposed models are applied to the real-world data of tone perception for musical data analysis, and the one of temperature anomalies for the analysis of climate change data.

[1]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[2]  P. Green Iteratively reweighted least squares for maximum likelihood estimation , 1984 .

[3]  A. Azzalini A class of distributions which includes the normal ones , 1985 .

[4]  Ke Chen,et al.  Improved learning algorithms for mixture of experts in multiclass classification , 1999, Neural Networks.

[5]  Weixin Yao,et al.  Robust fitting of mixture regression models , 2012, Comput. Stat. Data Anal..

[6]  A. Azzalini,et al.  Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t‐distribution , 2003, 0911.2342.

[7]  Faicel Chamroukhi,et al.  Hidden process regression for curve modeling, classification and tracking , 2010 .

[8]  Xiao-Li Meng,et al.  Maximum likelihood estimation via the ECM algorithm: A general framework , 1993 .

[9]  YAN WEI,et al.  ROBUST MIXTURE REGRESSION MODELS USING T-DISTRIBUTION , 2012 .

[10]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Roderick Murray-Smith,et al.  Hierarchical Gaussian process mixtures for regression , 2005, Stat. Comput..

[12]  P. Deb Finite Mixture Models , 2008 .

[13]  Tsung-I Lin,et al.  Finite mixture modelling using the skew normal distribution , 2007 .

[14]  Christopher M. Bishop,et al.  Bayesian Hierarchical Mixtures of Experts , 2002, UAI.

[15]  Tsung I. Lin,et al.  Robust mixture modeling using multivariate skew t distributions , 2010, Stat. Comput..

[16]  Steve R. Waterhouse,et al.  Bayesian Methods for Mixtures of Experts , 1995, NIPS.

[17]  Geoffrey J. McLachlan,et al.  Finite mixtures of multivariate skew t-distributions: some recent and new results , 2014, Stat. Comput..

[18]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[19]  David E. Tyler,et al.  A curious likelihood identity for the multivariate t-distribution , 1994 .

[20]  Geoffrey J. McLachlan,et al.  Laplace mixture of linear experts , 2016, Comput. Stat. Data Anal..

[21]  Sharon X. Lee,et al.  Finite mixtures of canonical fundamental skew $$t$$t-distributions , 2014 .

[22]  S. Frühwirth-Schnatter,et al.  Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-t distributions. , 2010, Biostatistics.

[23]  Allou Samé,et al.  Time series modeling by a regression approach based on a latent process , 2009, Neural Networks.

[24]  Michael I. Jordan,et al.  Convergence results for the EM approach to mixtures of experts architectures , 1995, Neural Networks.

[25]  H. Akaike A new look at the statistical model identification , 1974 .

[26]  Joseph N. Wilson,et al.  Twenty Years of Mixture of Experts , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[27]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[28]  N. Henze A Probabilistic Representation of the 'Skew-normal' Distribution , 1986 .

[29]  Geoffrey J. McLachlan,et al.  On mixtures of skew normal and skew $$t$$-distributions , 2012, Adv. Data Anal. Classif..

[30]  Makiko Sato,et al.  A closer look at United States and global surface temperature change , 2001 .

[31]  T. Choi,et al.  Gaussian Process Regression Analysis for Functional Data , 2011 .

[32]  Hongtu Zhu,et al.  Clustering High-Dimensional Landmark-Based Two-Dimensional Shape Data , 2015, Journal of the American Statistical Association.

[33]  Makiko Sato,et al.  NASA GISS Surface Temperature (GISTEMP) Analysis , 2016 .

[34]  Wenxin Jiang,et al.  On the asymptotic normality of hierarchical mixtures-of-experts for generalized linear models , 2000, IEEE Trans. Inf. Theory.

[35]  Jack C. Lee,et al.  Robust mixture modeling using the skew t distribution , 2007, Stat. Comput..

[36]  Geoffrey J. McLachlan,et al.  Robust Cluster Analysis via Mixtures of Multivariate t-Distributions , 1998, SSPR/SPR.

[37]  Sylvia Frühwirth-Schnatter,et al.  Finite Mixture and Markov Switching Models , 2006 .

[38]  Allou Samé,et al.  A regression model with a hidden logistic process for feature extraction from time series , 2009, 2009 International Joint Conference on Neural Networks.

[39]  J. Mesirov,et al.  Automated high-dimensional flow cytometric data analysis , 2009, Proceedings of the National Academy of Sciences.

[40]  E. Cohen,et al.  Some Effects of Inharmonic Partials on Interval Perception , 1984 .

[41]  D. Rubin,et al.  ML ESTIMATION OF THE t DISTRIBUTION USING EM AND ITS EXTENSIONS, ECM AND ECME , 1999 .

[42]  V. H. Lachos,et al.  Robust mixture regression modeling based on scale mixtures of skew-normal distributions , 2016 .

[43]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[44]  Allou Samé,et al.  A hidden process regression model for functional data description. Application to curve discrimination , 2010, Neurocomputing.

[45]  Carl E. Rasmussen,et al.  Infinite Mixtures of Gaussian Process Experts , 2001, NIPS.

[46]  Weixin Yao,et al.  Robust mixture regression model fitting by Laplace distribution , 2014, Comput. Stat. Data Anal..

[47]  Geoffrey J. McLachlan,et al.  Using the EM algorithm to train neural networks: misconceptions and a new algorithm for multiclass classification , 2004, IEEE Transactions on Neural Networks.

[48]  Geoffrey J. McLachlan,et al.  Robust mixture modelling using the t distribution , 2000, Stat. Comput..

[49]  Steven R. Waterhouse,et al.  Classification and Regression using Mixtures of Experts , 1997 .

[50]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[51]  M. Genton,et al.  Moments of skew-normal random vectors and their quadratic forms , 2001 .

[52]  Makiko Sato,et al.  GISS analysis of surface temperature change , 1999 .