Sparse multivariate factor analysis regression models and its applications to integrative genomics analysis

The multivariate regression model is a useful tool to explore complex associations between two kinds of molecular markers, which enables the understanding of the biological pathways underlying disease etiology. For a set of correlated response variables, accounting for such dependency can increase statistical power. Motivated by integrative genomic data analyses, we propose a new methodology—sparse multivariate factor analysis regression model (smFARM), in which correlations of response variables are assumed to follow a factor analysis model with latent factors. This proposed method not only allows us to address the challenge that the number of association parameters is larger than the sample size, but also to adjust for unobserved genetic and/or nongenetic factors that potentially conceal the underlying response‐predictor associations. The proposed smFARM is implemented by the EM algorithm and the blockwise coordinate descent algorithm. The proposed methodology is evaluated and compared to the existing methods through extensive simulation studies. Our results show that accounting for latent factors through the proposed smFARM can improve sensitivity of signal detection and accuracy of sparse association map estimation. We illustrate smFARM by two integrative genomics analysis examples, a breast cancer dataset, and an ovarian cancer dataset, to assess the relationship between DNA copy numbers and gene expression arrays to understand genetic regulatory patterns relevant to the disease. We identify two trans‐hub regions: one in cytoband 17q12 whose amplification influences the RNA expression levels of important breast cancer genes, and the other in cytoband 9q21.32‐33, which is associated with chemoresistance in ovarian cancer.

[1]  Ji Zhu,et al.  Regularized Multivariate Regression for Identifying Master Predictors with Application to Integrative Genomics Study of Breast Cancer. , 2008, The annals of applied statistics.

[2]  Chloé Friguet,et al.  A Factor Model Approach to Multiple Testing Under Dependence , 2009 .

[3]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[4]  N. Higham COMPUTING A NEAREST SYMMETRIC POSITIVE SEMIDEFINITE MATRIX , 1988 .

[5]  Martin Schäfer,et al.  Cancer gene prioritization by integrative analysis of mRNA expression and DNA copy number data: a comparative review , 2011, Briefings Bioinform..

[6]  Seung C. Ahn,et al.  Eigenvalue Ratio Test for the Number of Factors , 2013 .

[7]  Stephen J. Wright,et al.  Simultaneous Variable Selection , 2005, Technometrics.

[8]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[9]  F. Markowetz,et al.  The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups , 2012, Nature.

[10]  Jianqing Fan,et al.  NETWORK EXPLORATION VIA THE ADAPTIVE LASSO AND SCAD PENALTIES. , 2009, The annals of applied statistics.

[11]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[12]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[13]  Wessel N van Wieringen,et al.  Exploratory factor analysis of pathway copy number data with an application towards the integration with gene expression data. , 2011, Journal of computational biology : a journal of computational molecular cell biology.

[14]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[15]  H. Schneeweiß,et al.  Factor Analysis and Principal Components , 1995 .

[16]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[17]  F. Dias,et al.  Determining the number of factors in approximate factor models with global and group-specific factors , 2008 .

[18]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[19]  A. Zellner An Efficient Method of Estimating Seemingly Unrelated Regressions and Tests for Aggregation Bias , 1962 .

[20]  David Causeur,et al.  A factor model to analyze heterogeneity in gene expression , 2010, BMC Bioinformatics.

[21]  Luis Mateus Rocha,et al.  Singular value decomposition and principal component analysis , 2003 .

[22]  Oliver Stegle,et al.  Accounting for Non-genetic Factors Improves the Power of eQTL Studies , 2008, RECOMB.

[23]  B. Peter,et al.  BOOSTING FOR HIGH-MULTIVARIATE RESPONSES IN HIGH-DIMENSIONAL LINEAR REGRESSION , 2006 .

[24]  A. Onatski TESTING HYPOTHESES ABOUT THE NUMBER OF FACTORS IN LARGE FACTOR MODELS , 2009 .

[25]  J. Collins,et al.  Inferring Genetic Networks and Identifying Compound Mode of Action via Expression Profiling , 2003, Science.

[26]  Ash A. Alizadeh,et al.  Genome-wide analysis of DNA copy number variation in breast cancer using DNA microarrays , 1999, Nature Genetics.

[27]  Björn Olsson,et al.  Specific copy number alterations associated with docetaxel/carboplatin response in ovarian carcinomas. , 2010, Anticancer research.

[28]  Shanthi Nagarajan,et al.  IKKβ inhibitor identification: a multi-filter driven novel scaffold , 2010, BMC Bioinformatics.

[29]  Ash A. Alizadeh,et al.  Genome-wide analysis of DNA copy-number changes using cDNA microarrays , 1999, Nature Genetics.

[30]  Christian A. Rees,et al.  Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Manuel Hidalgo,et al.  Nodal/Activin signaling drives self-renewal and tumorigenicity of pancreatic cancer stem cells and provides a target for combined drug therapy. , 2011, Cell stem cell.

[32]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[33]  G. Gibson The environmental contribution to gene expression profiles , 2008, Nature Reviews Genetics.

[34]  M. Hendrix,et al.  Nodal signaling promotes a tumorigenic phenotype in human breast cancer. , 2014, Seminars in cancer biology.

[35]  Gabriel Krigsfeld,et al.  TNFSF10 (TRAIL), a p53 target gene that mediates p53-dependent cell death , 2008, Cancer biology & therapy.

[36]  A. Barabasi,et al.  Lethality and centrality in protein networks , 2001, Nature.

[37]  Marcel J T Reinders,et al.  Imaging , Diagnosis , Prognosis Clinical Cancer Research Integration of DNA Copy Number Alterations and Prognostic Gene Expression Signatures in Breast Cancer Patients , 2010 .

[38]  Noah Simon,et al.  A Sparse-Group Lasso , 2013 .

[39]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[40]  Mu Zhu,et al.  A factor analysis model for functional genomics , 2005, BMC Bioinformatics.

[41]  Carlos Caldas,et al.  A sparse regulatory network of copy-number driven expression reveals putative breast cancer oncogenes , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[42]  Chih-Ling Tsai,et al.  MODEL SELECTION FOR MULTIVARIATE REGRESSION IN SMALL SAMPLES , 1994 .

[43]  Pierre-Marie Martin,et al.  Quantification and clinical relevance of gene amplification at chromosome 17q12-q21 in human epidermal growth factor receptor 2-amplified breast cancers , 2011, Breast Cancer Research.

[44]  Robert Tibshirani,et al.  Distinct patterns of DNA copy number alteration are associated with different clinicopathological features and gene‐expression subtypes of breast cancer , 2006, Genes, chromosomes & cancer.

[45]  Herman Rubin,et al.  Statistical Inference in Factor Analysis , 1956 .

[46]  Barbara C Vanderhyden,et al.  17β-estradiol upregulates GREB1 and accelerates ovarian tumor progression in vivo , 2014, International journal of cancer.

[47]  William B. Coleman,et al.  Loss of tumor necrosis factor superfamily genes in breast cancer cell lines (1047.8) , 2014 .