A latent unknown clustering integrating multi-omics data (LUCID) with phenotypic traits

MOTIVATION Epidemiologic, clinical, and translational studies are increasingly generating multiplatform omics data. Methods that can integrate across multiple high-dimensional data types while accounting for differential patterns are critical for uncovering novel associations and underlying relevant subgroups. RESULTS We propose an integrative model to estimate latent unknown clusters (LUCID) aiming to both distinguish unique genomic, exposure and informative biomarkers/omic effects while jointly estimating subgroups relevant to the outcome of interest. Simulation studies indicate that we can obtain consistent estimates reflective of the true simulated values, accurately estimate subgroups, and recapitulate subgroup-specific effects. We also demonstrate the use of the integrated model for future prediction of risk subgroups and phenotypes. We apply this approach to two real data applications to highlight the integration of genomic, exposure, and metabolomic data. AVAILABILITY AND IMPLEMENTATION The LUCID method is implemented through the LUCIDus R package available on CRAN (https://CRAN.R-project.org/package=LUCIDus). SUPPLEMENTARY INFORMATION Supplementary materials are available at Bioinformatics online.

[1]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[2]  Jian Huang,et al.  COORDINATE DESCENT ALGORITHMS FOR NONCONVEX PENALIZED REGRESSION, WITH APPLICATIONS TO BIOLOGICAL FEATURE SELECTION. , 2011, The annals of applied statistics.

[3]  Yen-Tsung Huang,et al.  Integrative modeling of multi‐platform genomic data under the framework of mediation analysis , 2015, Statistics in medicine.

[4]  Barry Shane,et al.  A mathematical model gives insights into nutritional and genetic aspects of folate-mediated one-carbon metabolism. , 2006, The Journal of nutrition.

[5]  K. Siegmund,et al.  Study-design issues in the development of the University of Southern California Consortium's Colorectal Cancer Family Registry. , 1999, Journal of the National Cancer Institute. Monographs.

[6]  V. Mootha,et al.  Metabolite profiles and the risk of developing diabetes , 2011, Nature Medicine.

[7]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[8]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[9]  Daniela M. Witten,et al.  An Introduction to Statistical Learning: with Applications in R , 2013 .

[10]  Christopher P. Fischer,et al.  Genome-wide association study of colorectal cancer identifies six new susceptibility loci , 2015, Nature Communications.

[11]  M. Goran,et al.  Association between insulin sensitivity and post-glucose challenge plasma insulin values in overweight Latino youth. , 2003, Diabetes care.

[12]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[13]  I. Kanazawa,et al.  Genetic association of CTNNA3 with late-onset Alzheimer's disease in females. , 2007, Human molecular genetics.

[14]  Adam B. Olshen,et al.  Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis , 2009, Bioinform..

[15]  C. Sander,et al.  Pattern discovery and cancer gene identification in integrated cancer genomic data , 2013, Proceedings of the National Academy of Sciences.

[16]  Xiao-Li Meng,et al.  Using EM to Obtain Asymptotic Variance-Covariance Matrices: The SEM Algorithm , 1991 .

[17]  Jiehuan Sun,et al.  Regularized Latent Class Model for Joint Analysis of High‐Dimensional Longitudinal Biomarkers and a Time‐to‐Event Outcome , 2018, Biometrics.

[18]  J. Towbin,et al.  Assessment of the CTNNA3 gene encoding human αT-catenin regarding its involvement in dilated cardiomyopathy , 2003, Human Genetics.

[19]  C. Aguilar-Salinas,et al.  Metabolomics in diabetes, a review , 2016, Annals of medicine.

[20]  M. Ritchie,et al.  Methods of integrating data to uncover genotype–phenotype interactions , 2015, Nature Reviews Genetics.

[21]  Yingying Fan,et al.  Tuning parameter selection in high dimensional penalized likelihood , 2013, 1605.03321.

[22]  Yu Jiang,et al.  A Selective Review of Multi-Level Omics Data Integration Using Variable Selection , 2019, High-throughput.

[23]  Wei Zhang,et al.  Estimating and testing high-dimensional mediation effects in epigenetic studies , 2016, Bioinform..

[24]  N. Pearce,et al.  Cancer subtypes in aetiological research , 2017, European Journal of Epidemiology.

[25]  F. Markowetz,et al.  The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups , 2012, Nature.

[26]  Dean P. Jones,et al.  High-performance metabolic profiling with dual chromatography-Fourier-transform mass spectrometry (DC-FTMS) for study of the exposome , 2011, Metabolomics.

[27]  Duncan C Thomas,et al.  Multistage sampling for latent variable models , 2007, Lifetime data analysis.

[28]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[29]  R. Bergman,et al.  Impaired Glucose Tolerance and Reduced β-Cell Function in Overweight Latino Children with a Positive Family History for Type 2 Diabetes , 2004 .

[30]  Karan Uppal,et al.  Reference Standardization for Mass Spectrometry and High-resolution Metabolomics Applications to Exposome Research. , 2015, Toxicological sciences : an official journal of the Society of Toxicology.

[31]  Robert Tibshirani,et al.  Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy , 1986 .

[32]  L. Liang,et al.  iGWAS: Integrative Genome‐Wide Association Studies of Genetic and Genomic Data for Disease Susceptibility Using Mediation Analysis , 2015, Genetic epidemiology.

[33]  R. Tibshirani,et al.  Covariance‐regularized regression and classification for high dimensional problems , 2009, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[34]  Mario Schmidt,et al.  The Sankey Diagram in Energy and Material Flow Management , 2008 .

[35]  John D Potter,et al.  Colon Cancer Family Registry: An International Resource for Studies of the Genetic Epidemiology of Colon Cancer , 2007, Cancer Epidemiology Biomarkers & Prevention.

[36]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[37]  Shuzhao Li,et al.  Predicting Network Activity from High Throughput Metabolomics , 2013, PLoS Comput. Biol..

[38]  Xihong Lin,et al.  JOINT ANALYSIS OF SNP AND GENE EXPRESSION DATA IN GENETIC ASSOCIATION STUDIES OF COMPLEX DISEASES. , 2014, The annals of applied statistics.