Pursuing sources of heterogeneity in modeling clustered population

Researchers often have to deal with heterogeneous population with mixed regression relationships, increasingly so in the era of data explosion. In such problems, when there are many candidate predictors, it is not only of interest to identify the predictors that are associated with the outcome, but also to distinguish the true sources of heterogeneity, i.e., to identify the predictors that have different effects among the clusters and thus are the true contributors to the formation of the clusters. We clarify the concepts of the source of heterogeneity that account for potential scale differences of the clusters and propose a regularized finite mixture effects regression to achieve heterogeneity pursuit and feature selection simultaneously. As the name suggests, the problem is formulated under an effects-model parameterization, in which the cluster labels are missing and the effect of each predictor on the outcome is decomposed to a common effect term and a set of cluster-specific terms. A constrained sparse estimation of these effects leads to the identification of both the variables with common effects and those with heterogeneous effects. We propose an efficient algorithm and show that our approach can achieve both estimation and selection consistency. Simulation studies further demonstrate the effectiveness of our method under various practical scenarios. Three applications are presented, namely, an imaging genetics study for linking genetic factors and brain neuroimaging traits in Alzheimer's disease, a public health study for exploring the association between suicide risk among adolescents and their school district characteristics, and a sport analytics study for understanding how the salary levels of baseball players are associated with their performance and contractual status.

[1]  C. Jack,et al.  Alzheimer's Disease Neuroimaging Initiative , 2008 .

[2]  R. Tibshirani,et al.  The solution path of the generalized lasso , 2010, 1005.1971.

[3]  Wei Pan,et al.  Statistica Sinica Preprint No : SS-2016-0531 Title A New Semiparametric Approach to Finite Mixture of Regressions using Penalized Regression via Fusion , 2018 .

[4]  Tom Goldstein,et al.  The Split Bregman Method for L1-Regularized Problems , 2009, SIAM J. Imaging Sci..

[5]  Zhiyuan Xu,et al.  Imaging-wide association study: Integrating imaging endophenotypes in GWAS , 2017, NeuroImage.

[6]  Mitchell Watnik,et al.  Pay for Play: Are Baseball Salaries Based on Performance? , 1998 .

[7]  Jiahua Chen,et al.  Variable Selection in Finite Mixture of Regression Models , 2007 .

[8]  Xiwei Tang,et al.  Individualized Multidirectional Variable Selection , 2017, Journal of the American Statistical Association.

[9]  Luis Weruaga,et al.  Sparse Multivariate Gaussian Mixture Regression , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[10]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[11]  Dinggang Shen,et al.  Robust Deformable-Surface-Based Skull-Stripping for Large-Scale Studies , 2011, MICCAI.

[12]  Y. She Sparse regression with exact clustering , 2008 .

[13]  Mark E. Schmidt,et al.  The Alzheimer’s Disease Neuroimaging Initiative: A review of papers published since its inception , 2012, Alzheimer's & Dementia.

[14]  Jian Huang,et al.  A Concave Pairwise Fusion Approach to Subgroup Analysis , 2015, 1508.07045.

[15]  Nick C Fox,et al.  The Alzheimer's disease neuroimaging initiative (ADNI): MRI methods , 2008, Journal of magnetic resonance imaging : JMRI.

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  S. Geer,et al.  ℓ1-penalization for mixture regression models , 2010, 1202.6046.

[18]  Dinggang Shen,et al.  Measuring temporal morphological changes robustly in brain MR images via 4-dimensional template warping , 2004, NeuroImage.

[19]  Nick C Fox,et al.  Common variants in ABCA7, MS4A6A/MS4A4E, EPHA1, CD33 and CD2AP are associated with Alzheimer’s disease , 2011, Nature Genetics.

[20]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[21]  M. Tanner,et al.  Hierarchical mixtures-of-experts for exponential family regression models: approximation and maximum , 1999 .

[22]  Xiao-Li Meng,et al.  Using EM to Obtain Asymptotic Variance-Covariance Matrices: The SEM Algorithm , 1991 .

[23]  S. Goldfeld,et al.  A Markov model for switching regressions , 1973 .

[24]  Hongtu Zhu,et al.  Structured Genome-Wide Association Studies with Bayesian Hierarchical Variable Selection , 2019, Genetics.

[25]  Xiaotong Shen,et al.  Estimation of multiple networks in Gaussian mixture models. , 2016, Electronic journal of statistics.

[26]  Shili Lin,et al.  Regularization in Finite Mixture of Regression Models with Diverging Number of Parameters , 2013, Biometrics.

[27]  Xiao-Li Meng,et al.  Maximum likelihood estimation via the ECM algorithm: A general framework , 1993 .

[28]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[29]  L. Tan,et al.  Bridging integrator 1 (BIN1): form, function, and Alzheimer's disease. , 2013, Trends in molecular medicine.

[30]  Xiaotong Shen,et al.  Variable Selection in Penalized Model‐Based Clustering Via Regularization on Grouped Parameters , 2008, Biometrics.

[31]  Abbas Khalili,et al.  An Overview of the New Feature Selection Methods in Finite Mixture of Regression Models , 2011 .

[32]  W. DeSarbo,et al.  A mixture likelihood approach for generalized linear models , 1995 .

[33]  Mark E. Schmidt,et al.  The Alzheimer's Disease Neuroimaging Initiative: A review of papers published since its inception , 2012, Alzheimer's & Dementia.

[34]  Robert H Aseltine,et al.  Using Hospitalization and Mortality Data to Identify Areas at Risk for Adolescent Suicide. , 2017, The Journal of adolescent health : official publication of the Society for Adolescent Medicine.

[35]  Christine Van Broeckhoven,et al.  The genetic landscape of Alzheimer disease: clinical implications and perspectives , 2015, Genetics in Medicine.

[36]  Dankmar Böhning,et al.  Computer-Assisted Analysis of Mixtures and Applications , 2000, Technometrics.

[37]  David I. Warton,et al.  Multi-species distribution modeling using penalized mixture of regressions , 2015, 1509.04834.

[38]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[39]  Olcay Arslan,et al.  Parameter estimation for mixtures of skew Laplace normal distributions and application in mixture regression modeling , 2017 .

[40]  Ying Nian Wu,et al.  Bayesian variable selection for finite mixture model of linear regressions , 2016, Comput. Stat. Data Anal..

[41]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[42]  W. Yao,et al.  Mixture of linear mixed models using multivariate t distribution , 2016 .

[43]  Ming-Hui Chen,et al.  A Tailored Multivariate Mixture Model for Detecting Proteins of Concordant Change Among Virulent Strains of Clostridium Perfringens , 2018, Journal of the American Statistical Association.

[44]  Hongtu Zhu,et al.  GWAS of 19,629 individuals identifies novel genetic variants for regional brain volumes and refines their genetic co-architecture with cognitive and mental health traits , 2019, bioRxiv.

[45]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[46]  Heping Zhang,et al.  Genome‐wide mediation analysis of psychiatric and cognitive traits through imaging phenotypes , 2017, Human brain mapping.

[47]  Michael W. Weiner,et al.  Mining Outcome-relevant Brain Imaging Genetic Associations via Three-way Sparse Canonical Correlation Analysis in Alzheimer’s Disease , 2017, Scientific Reports.