Clustering high‐dimensional mixed data to uncover sub‐phenotypes: joint analysis of phenotypic and genotypic data

The LIPGENE-SU.VI.MAX study, like many others, recorded high-dimensional continuous phenotypic data and categorical genotypic data. LIPGENE-SU.VI.MAX focuses on the need to account for both phenotypic and genetic factors when studying the metabolic syndrome (MetS), a complex disorder that can lead to higher risk of type 2 diabetes and cardiovascular disease. Interest lies in clustering the LIPGENE-SU.VI.MAX participants into homogeneous groups or sub-phenotypes, by jointly considering their phenotypic and genotypic data, and in determining which variables are discriminatory. A novel latent variable model that elegantly accommodates high dimensional, mixed data is developed to cluster LIPGENE-SU.VI.MAX participants using a Bayesian finite mixture model. A computationally efficient variable selection algorithm is incorporated, estimation is via a Gibbs sampling algorithm and an approximate BIC-MCMC criterion is developed to select the optimal model. Two clusters or sub-phenotypes ('healthy' and 'at risk') are uncovered. A small subset of variables is deemed discriminatory, which notably includes phenotypic and genotypic variables, highlighting the need to jointly consider both factors. Further, 7 years after the LIPGENE-SU.VI.MAX data were collected, participants underwent further analysis to diagnose presence or absence of the MetS. The two uncovered sub-phenotypes strongly correspond to the 7-year follow-up disease classification, highlighting the role of phenotypic and genotypic factors in the MetS and emphasising the potential utility of the clustering approach in early screening. Additionally, the ability of the proposed approach to define the uncertainty in sub-phenotype membership at the participant level is synonymous with the concepts of precision medicine and nutrition. Copyright © 2017 John Wiley & Sons, Ltd.

[1]  P. Green,et al.  On Bayesian Analysis of Mixtures with an Unknown Number of Components (with discussion) , 1997 .

[2]  L. Cupples,et al.  ACC2 gene polymorphisms, metabolic syndrome, and gene-nutrient interactions with dietary fat , 2010, Journal of Lipid Research.

[3]  Petros Dellaportas,et al.  Positive embedded integration in Bayesian analysis , 1991 .

[4]  C. Viroli,et al.  A factor mixture analysis model for multivariate binary data , 2010, 1010.2314.

[5]  D. Dunson,et al.  Bayesian latent variable models for mixed discrete outcomes. , 2005, Biostatistics.

[6]  Jean-Paul Fox,et al.  Bayesian Item Response Modeling , 2010 .

[7]  P. Gustafson,et al.  Conservative prior distributions for variance parameters in hierarchical models , 2006 .

[8]  S. Bertrais,et al.  Gene-nutrient interactions and gender may modulate the association between ApoA1 and ApoB gene polymorphisms and metabolic syndrome risk. , 2011, Atherosclerosis.

[9]  B. S. Everitt,et al.  The clustering of mixed-mode data: A comparison of possible approaches , 1990 .

[10]  Paul D. McNicholas,et al.  Parsimonious Gaussian mixture models , 2008, Stat. Comput..

[11]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[12]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[13]  McparlandDamien,et al.  Model based clustering for mixed data , 2016 .

[14]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[15]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[16]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[17]  Paul L. Huang A comprehensive definition for metabolic syndrome , 2009, Disease Models & Mechanisms.

[18]  Thomas Brendan Murphy,et al.  Mixture of latent trait analyzers for model-based clustering of categorical data , 2013, Statistics and Computing.

[19]  Geert Molenberghs,et al.  A high‐dimensional joint model for longitudinal outcomes of different nature , 2008, Statistics in medicine.

[20]  Damien McParland,et al.  Model based clustering for mixed data: clustMD , 2015, Advances in Data Analysis and Classification.

[21]  Helga Wagner,et al.  Bayesian estimation of random effects models for multivariate responses of mixed data , 2010, Comput. Stat. Data Anal..

[22]  P. Deb Finite Mixture Models , 2008 .

[23]  Paul Zimmet,et al.  The metabolic syndrome—a new worldwide definition , 2005, The Lancet.

[24]  M. R. Novick,et al.  Statistical Theories of Mental Test Scores. , 1971 .

[25]  Shashaank Vattikuti,et al.  Heritability and Genetic Correlations Explained by Common SNPs for Metabolic Syndrome Traits , 2012, PLoS genetics.

[26]  Sik-Yum Lee,et al.  A Bayesian analysis of finite mixtures in the LISREL model , 2001 .

[27]  C. Robert,et al.  Computational and Inferential Difficulties with Mixture Posterior Distributions , 2000 .

[28]  Sylvia Frühwirth-Schnatter,et al.  Dealing with Label Switching under Model Uncertainty , 2011 .

[29]  M. Stephens Bayesian analysis of mixture models with an unknown number of components- an alternative to reversible jump methods , 2000 .

[30]  M. Johnson,et al.  Circulating microRNAs in Sera Correlate with Soluble Biomarkers of Immune Activation but Do Not Predict Mortality in ART Treated Individuals with HIV-1 Infection: A Case Control Study , 2015, PloS one.

[31]  Adrian E. Raftery,et al.  Inference in model-based cluster analysis , 1997, Stat. Comput..

[32]  Wei Pan,et al.  Penalized mixtures of factor analyzers with application to clustering high-dimensional microarray data , 2010, Bioinform..

[33]  Geoffrey J. McLachlan,et al.  Mixtures of Factor Analyzers with Common Factor Loadings: Applications to the Clustering and Visualization of High-Dimensional Data , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Alain Favier,et al.  The SU.VI.MAX Study: a randomized, placebo-controlled trial of the health effects of antioxidant vitamins and minerals. , 2004, Archives of internal medicine.

[35]  Damien McParland,et al.  CLUSTERING SOUTH AFRICAN HOUSEHOLDS BASED ON THEIR ASSET STATUS USING LATENT VARIABLE MODELS. , 2014, The annals of applied statistics.

[36]  L. Cupples,et al.  Gene-nutrient interactions with dietary fat modulate the association between genetic variation of the ACSL1 gene and metabolic syndrome , 2010, Journal of Lipid Research.

[37]  I. C. Gormley,et al.  Analysis of Irish third‐level college applications data , 2006 .

[38]  Angela Montanari,et al.  Penalized factor mixture analysis for variable selection in clustered data , 2009, Comput. Stat. Data Anal..

[39]  I. C. Gormley,et al.  A mixture of experts model for rank data with applications in election studies , 2008, 0901.4203.

[40]  A. Raftery,et al.  Variable Selection for Model-Based Clustering , 2006 .

[41]  D. Strickland,et al.  LDL receptor-related protein 1: unique tissue-specific functions revealed by selective gene knockout studies. , 2008, Physiological reviews.

[42]  Cinzia Viroli,et al.  Dimensionally Reduced Model-Based Clustering Through Mixtures of Factor Mixture Analyzers , 2010, J. Classif..

[43]  Ji Zhu,et al.  Variable Selection for Model‐Based High‐Dimensional Clustering and Its Application to Microarray Data , 2008, Biometrics.

[44]  Alexander R. De Leon,et al.  Analysis of Mixed Data : Methods & Applications , 2013 .

[45]  Michael A. West,et al.  BAYESIAN MODEL ASSESSMENT IN FACTOR ANALYSIS , 2004 .

[46]  G. McLachlan,et al.  Mixtures of Factor Analyzers with Common Factor Loadings for the Clustering and Visualisation of High-Dimensional Data , 2008 .

[47]  M. Vannucci,et al.  Bayesian Variable Selection in Clustering High-Dimensional Data , 2005 .

[48]  A. Gelman Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper) , 2004 .

[49]  M. Walsh,et al.  Can metabotyping help deliver the promise of personalised nutrition? , 2015, Proceedings of the Nutrition Society.

[50]  C. Viroli,et al.  Infinite Mixtures of Infinite Factor Analysers: Nonparametric Model-Based Clustering via Latent Gaussian Models , 2017 .

[51]  Yiu-Fai Yung,et al.  Finite mixtures in confirmatory factor-analysis models , 1997 .

[52]  J. Schrezenmeir,et al.  A variant in the heart-specific fatty acid transport protein 6 is associated with lower fasting and postprandial TAG, blood pressure and left ventricular hypertrophy , 2011, British Journal of Nutrition.

[53]  L. Cupples,et al.  Complement component 3 polymorphisms interact with polyunsaturated fatty acids to modulate risk of metabolic syndrome. , 2009, The American journal of clinical nutrition.

[54]  B. Wu,et al.  Copula‐based regression models for a bivariate mixed discrete and continuous outcome , 2011, Statistics in medicine.

[55]  J. Vermunt,et al.  Latent class cluster analysis , 2002 .

[56]  S. Hercberg,et al.  High dietary saturated fat intake accentuates obesity risk associated with the fat mass and obesity-associated gene in adults. , 2012, The Journal of nutrition.

[57]  Nial Friel,et al.  Estimating the evidence – a review , 2011, 1111.1957.

[58]  L. Cupples,et al.  Leptin receptor polymorphisms interact with polyunsaturated fatty acids to augment risk of insulin resistance and metabolic syndrome in adults. , 2010, The Journal of nutrition.

[59]  Joseph L. Goldstein,et al.  Sterol-regulated transport of SREBPs from endoplasmic reticulum to Golgi: Oxysterols block transport by binding to Insig , 2007, Proceedings of the National Academy of Sciences.

[60]  B. S. Everitt,et al.  A finite mixture model for the clustering of mixed-mode data , 1988 .

[61]  Masaaki Muramatsu,et al.  Knowledge-based computational search for genes associated with the metabolic syndrome , 2005, Bioinform..

[62]  Fionn Murtagh,et al.  Theme Articles on Classification and Geometric Data Analysis , 2014, J. Classif..

[63]  Yee Whye Teh,et al.  Dirichlet Process , 2017, Encyclopedia of Machine Learning and Data Mining.

[64]  Wm. R. Wright General Intelligence, Objectively Determined and Measured. , 1905 .

[65]  Helen Roche,et al.  Prediction of the metabolic syndrome status based on dietary and genetic parameters, using Random Forest , 2008, Genes & Nutrition.

[66]  D. Gordon,et al.  High-density lipoprotein cholesterol and cardiovascular disease. Four prospective American studies. , 1989, Circulation.

[67]  A. Raftery,et al.  Model‐based clustering for social networks , 2007 .

[68]  Nikolas Kantas,et al.  Bayesian parameter inference for partially observed stopped processes , 2012, Stat. Comput..

[69]  M. Stephens Dealing with label switching in mixture models , 2000 .

[70]  Torsten Hothorn,et al.  A unified framework of constrained regression , 2014, Stat. Comput..

[71]  L. Cupples,et al.  Additive effect of polymorphisms in the IL-6, LTA, and TNF-{alpha} genes and plasma fatty acid level modulate risk for the metabolic syndrome and its components. , 2010, The Journal of clinical endocrinology and metabolism.

[72]  D. Levy,et al.  Prediction of coronary heart disease using risk factor categories. , 1998, Circulation.

[73]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[74]  L. Citterio,et al.  Genetics of renal mechanisms of primary hypertension: the role of adducin , 1997, Journal of hypertension.

[75]  Margaret R. Karagas,et al.  Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions , 2008, BMC Bioinformatics.

[76]  C. Drevon,et al.  Gene-nutrient interactions in the metabolic syndrome: single nucleotide polymorphisms in ADIPOQ and ADIPOR1 interact with plasma saturated fatty acids to modulate insulin resistance. , 2010, The American journal of clinical nutrition.

[77]  Wei Pan,et al.  Penalized Model-Based Clustering with Application to Variable Selection , 2007, J. Mach. Learn. Res..

[78]  Gertraud Malsiner-Walli,et al.  Model-based clustering based on sparse finite Gaussian mixtures , 2014, Statistics and Computing.

[79]  Xin-Yuan Song,et al.  A mixture of generalized latent variable models for mixed mode and heterogeneous data , 2011, Comput. Stat. Data Anal..

[80]  Lipika Dey,et al.  A k-mean clustering algorithm for mixed numeric and categorical data , 2007, Data Knowl. Eng..

[81]  Peter D. Hoff,et al.  Latent Space Approaches to Social Network Analysis , 2002 .

[82]  L. Chan,et al.  Apolipoprotein B, the major protein component of triglyceride-rich and low density lipoproteins. , 1992, The Journal of biological chemistry.

[83]  B. Muthén,et al.  Finite Mixture Modeling with Mixture Outcomes Using the EM Algorithm , 1999, Biometrics.

[84]  Jim Albert,et al.  Ordinal Data Modeling , 2000 .

[85]  G. Celeux,et al.  Variable Selection for Clustering with Gaussian Mixture Models , 2009, Biometrics.

[86]  Murray A. Jorgensen,et al.  Theory & Methods: Mixture model clustering using the MULTIMIX program , 1999 .

[87]  Damien McParland,et al.  Clustering Ordinal Data via Latent Variable Models , 2013, Algorithms from and for Nature and Life.

[88]  Kevin M. Quinn,et al.  Bayesian Factor Analysis for Mixed Ordinal and Continuous Responses , 2004, Political Analysis.

[89]  R. McManus,et al.  Genetic and nutrient determinants of the metabolic syndrome , 2006, Current opinion in cardiology.

[90]  Ryan P. Browne,et al.  Model-based clustering, classification, and discriminant analysis of data with mixed type , 2012 .

[91]  T. Kita,et al.  An endothelial receptor for oxidized low-density lipoprotein , 1997, Nature.

[92]  Lynette A. Hunt,et al.  Mixture model clustering for mixed data with missing information , 2003, Comput. Stat. Data Anal..

[93]  Dimitris Karlis,et al.  Model-based clustering using copulas with applications , 2014, Statistics and Computing.

[94]  Paul D. McNicholas,et al.  Variable Selection for Clustering and Classification , 2013, J. Classif..

[95]  D. M. Titterington,et al.  Mixtures of Factor Analysers. Bayesian Estimation and Inference by Stochastic Simulation , 2004, Machine Learning.

[96]  Joseph G Ibrahim,et al.  Joint modeling of longitudinal and survival data with missing and left‐censored time‐varying covariates , 2014, Statistics in medicine.

[97]  Julien Jacques,et al.  Model-based clustering of multivariate ordinal data relying on a stochastic binary search algorithm , 2015, Statistics and Computing.

[98]  S. Chib,et al.  Bayesian analysis of binary and polychotomous response data , 1993 .

[99]  Isobel Claire Gormley,et al.  Infinite Mixtures of Infinite Factor Analysers , 2017, Bayesian Analysis.

[100]  D. Dunson,et al.  Sparse Bayesian infinite factor models. , 2011, Biometrika.

[101]  Mark I. McCarthy,et al.  SAIL—a software system for sample and phenotype availability across biobanks and cohorts , 2010, Bioinform..

[102]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[103]  S. Frühwirth-Schnatter Estimating Marginal Likelihoods for Mixture and Markov Switching Models Using Bridge Sampling Techniques , 2004 .

[104]  Geoffrey E. Hinton,et al.  The EM algorithm for mixtures of factor analyzers , 1996 .

[105]  Judith Rousseau,et al.  Overfitting Bayesian Mixture Models with an Unknown Number of Components , 2015, PloS one.

[106]  Sylvia Frühwirth-Schnatter,et al.  Finite Mixture and Markov Switching Models , 2006 .

[107]  Xihong Lin,et al.  JOINT ANALYSIS OF SNP AND GENE EXPRESSION DATA IN GENETIC ASSOCIATION STUDIES OF COMPLEX DISEASES. , 2014, The annals of applied statistics.

[108]  J W Jukema,et al.  The role of a common variant of the cholesteryl ester transfer protein gene in the progression of coronary atherosclerosis. The Regression Growth Evaluation Statin Study Group. , 1998, The New England journal of medicine.

[109]  J. Shaw,et al.  Metabolic syndrome—a new world‐wide definition. A Consensus Statement from the International Diabetes Federation , 2006, Diabetic medicine : a journal of the British Diabetic Association.

[110]  Isabella Morlini A latent variables approach for clustering mixed binary and continuous variables within a Gaussian mixture model , 2012, Adv. Data Anal. Classif..

[111]  C. Biernacki,et al.  Model-based clustering of Gaussian copulas for mixed data , 2014, 1405.1299.

[112]  S. Bertrais,et al.  Dietary saturated fat, gender and genetic variation at the TCF7L2 locus predict the development of metabolic syndrome. , 2012, The Journal of nutritional biochemistry.

[113]  Elena A. Erosheva,et al.  A semiparametric approach to mixed outcome latent variable models: Estimating the association between cognition and regional brain volumes , 2013, 1401.2728.

[114]  Zoubin Ghahramani,et al.  Variational Inference for Bayesian Mixtures of Factor Analysers , 1999, NIPS.

[115]  G. Peloso,et al.  Dietary saturated fat modulates the association between STAT3 polymorphisms and abdominal obesity in adults. , 2009, The Journal of nutrition.

[116]  W. Hermens,et al.  Intestinal-type and liver-type fatty acid-binding protein in the intestine. Tissue distribution and clinical utility. , 2003, Clinical biochemistry.

[117]  Jared S. Murray,et al.  Bayesian Gaussian Copula Factor Models for Mixed Data , 2011, Journal of the American Statistical Association.