Decomposition of variation of mixed variables by a latent mixed Gaussian copula model

Many biomedical studies collect data of mixed types of variables from multiple groups of subjects. Some of these studies aim to find the group‐specific and the common variation among all these variables. Even though similar problems have been studied by some previous works, their methods mainly rely on the Pearson correlation, which cannot handle mixed data. To address this issue, we propose a latent mixed Gaussian copula (LMGC) model that can quantify the correlations among binary, ordinal, continuous, and truncated variables in a unified framework. We also provide a tool to decompose the variation into the group‐specific and the common variation over multiple groups via solving a regularized M‐estimation problem. We conduct extensive simulation studies to show the advantage of our proposed method over the Pearson correlation‐based methods. We also demonstrate that by jointly solving the M‐estimation problem over multiple groups, our method is better than decomposing the variation group by group. We also apply our method to a Chlamydia trachomatis genital tract infection study to demonstrate how it can be used to discover informative biomarkers that differentiate patients.

[1]  Irina Gaynanova,et al.  latentcor: An R Package for estimating latent correlations from mixed data types , 2021, J. Open Source Softw..

[2]  Karen L. Mohlke,et al.  Inferring Regulatory Networks From Mixed Observational Data Using Directed Acyclic Graphs , 2020, Frontiers in Genetics.

[3]  Hongtu Zhu,et al.  D-CCA: A Decomposition-Based Canonical Correlation Analysis for High-Dimensional Datasets , 2020, Journal of the American Statistical Association.

[4]  Taylor B. Poston,et al.  Anti‐chlamydia IgG and IgA are insufficient to prevent endometrial chlamydia infection in women, and increased anti‐chlamydia IgG is associated with enhanced risk for incident infection , 2019, American journal of reproductive immunology.

[5]  Yang Ning,et al.  High-dimensional Mixed Graphical Model with Ordinal Data: Parameter Estimation and Statistical Inference , 2019, AISTATS.

[6]  Taylor B. Poston,et al.  Cervical Cytokines Associated With Chlamydia trachomatis Susceptibility and Protection. , 2019, The Journal of infectious diseases.

[7]  J. Booth,et al.  Rank-based approach for estimating correlations in mixed ordinal data , 2018, 1809.06255.

[8]  Eric F Lock,et al.  Generalized integrative principal component analysis for multi-type data with block-wise missing structure. , 2018, Biostatistics.

[9]  Raymond J. Carroll,et al.  Sparse semiparametric canonical correlation analysis for data of mixed types. , 2018, Biometrika.

[10]  Andrew J. Olive,et al.  Pathology after Chlamydia trachomatis infection is driven by nonprotective immune cells that are distinct from protective populations , 2018, Proceedings of the National Academy of Sciences.

[11]  L. Schaefer,et al.  Biglycan, a novel trigger of Th1 and Th17 cell recruitment into the kidney. , 2017, Matrix biology : journal of the International Society for Matrix Biology.

[12]  Jeong-Seok Nam,et al.  C-C motif chemokine receptor 1 (CCR1) is a target of the EGF-AKT-mTOR-STAT3 signaling axis in breast cancer cells , 2017, Oncotarget.

[13]  Gen Li,et al.  A general framework for association analysis of heterogeneous data , 2017, The Annals of Applied Statistics.

[14]  J. S. Marron,et al.  Angle-based joint and individual variation explained , 2017, J. Multivar. Anal..

[15]  João Pedro de Magalhães,et al.  Gene co-expression analysis for functional classification and gene–disease predictions , 2017, Briefings Bioinform..

[16]  Lorenzo Trippa,et al.  Multi‐study factor analysis , 2016, Biometrics.

[17]  T. Darville,et al.  Analysis of Factors Driving Incident and Ascending Infection and the Role of Serum Antibody in Chlamydia trachomatis Genital Tract Infection. , 2016, The Journal of infectious diseases.

[18]  Kim-Anh Do,et al.  DINGO: differential network analysis in genomics , 2015, Bioinform..

[19]  Tuo Zhao,et al.  Positive Semidefinite Rank-Based Correlation Matrix Estimation With Application to Semiparametric Graph Estimation , 2014, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[20]  H. Zou,et al.  High dimensional semiparametric latent graphical model for mixed data , 2014, 1404.7236.

[21]  Frank Emmert-Streib,et al.  Gene Sets Net Correlations Analysis (GSNCA): a multivariate differential coexpression test for gene sets , 2013, Bioinform..

[22]  P. Timms,et al.  The Duration of Chlamydia muridarum Genital Tract Infection and Associated Chronic Pathological Changes Are Reduced in IL-17 Knockout Mice but Protection Is Not Increased Further by Immunization , 2013, PloS one.

[23]  Seung C. Ahn,et al.  Eigenvalue Ratio Test for the Number of Factors , 2013 .

[24]  Ron Shamir,et al.  Dissection of Regulatory Networks that Are Altered in Disease via Differential Co-expression , 2013, PLoS Comput. Biol..

[25]  Andrzej Cichocki,et al.  Group Component Analysis for Multiblock Data: Common and Individual Feature Extraction , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[26]  Clifford Lam,et al.  Factor modeling for high-dimensional time series: inference for the number of factors , 2012, 1206.0613.

[27]  O. Alter,et al.  A Higher-Order Generalized Singular Value Decomposition for Comparison of Global mRNA Expression from Multiple Organisms , 2011, PloS one.

[28]  Tommy Löfstedt,et al.  OnPLS—a novel multiblock method for the modelling of predictive and orthogonal variation , 2011 .

[29]  Andrey A. Shabalin,et al.  Matrix eQTL: ultra fast eQTL analysis via large matrix operations , 2011, Bioinform..

[30]  Eric F Lock,et al.  JOINT AND INDIVIDUAL VARIATION EXPLAINED (JIVE) FOR INTEGRATED ANALYSIS OF MULTIPLE DATA TYPES. , 2011, The annals of applied statistics.

[31]  Rainer Breitling,et al.  DiffCoEx: a simple and sensitive method to find differentially coexpressed gene modules , 2010, BMC Bioinformatics.

[32]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[33]  E. Zotta,et al.  Peritumoral administration of granulocyte colony-stimulating factor induces an apoptotic response on a murine mammary adenocarcinoma , 2009, Cancer biology & therapy.

[34]  Christina Kendziorski,et al.  Statistical methods for gene set co-expression analysis , 2009, Bioinform..

[35]  Emmanuel J. Candès,et al.  The Power of Convex Relaxation: Near-Optimal Matrix Completion , 2009, IEEE Transactions on Information Theory.

[36]  Larry A. Wasserman,et al.  The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs , 2009, J. Mach. Learn. Res..

[37]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[38]  Michael Watson,et al.  CoXpress: differential co-expression in gene expression data , 2006, BMC Bioinformatics.

[39]  Y. Iwakura,et al.  The IL-23/IL-17 axis in inflammation. , 2006, The Journal of clinical investigation.

[40]  D. Botstein,et al.  Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[41]  K. Holmes,et al.  Endometrial Histopathology in Patients with Culture‐proved Upper Genital Tract Infection and Laparoscopically Diagnosed Acute Salpingitis , 1990, The American journal of surgical pathology.

[42]  C. Kelly,et al.  Macrophage-Inflammatory Protein-3 Mediates Epidermal Growth Factor Receptor Transactivation and ERK1/2 MAPK Signaling in Caco-2 Colonic Epithelial Cells via Metalloproteinase-Dependent Release of Amphiregulin , 2007 .

[43]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .