Bayesian multistudy factor analysis for high-throughput biological data

This paper presents a new modeling strategy for joint unsupervised analysis of multiple high-throughput biological studies. As in Multi-study Factor Analysis, our goals are to identify both common factors shared across studies and study-specific factors. Our approach is motivated by the growing body of high-throughput studies in biomedical research, as exemplified by the comprehensive set of expression data on breast tumors considered in our case study. To handle high-dimensional studies, we extend Multi-study Factor Analysis using a Bayesian approach that imposes sparsity. Specifically, we generalize the sparse Bayesian infinite factor model to multiple studies. We also devise novel solutions for the identification of the loading matrices: we recover the loading matrices of interest ex-post, by adapting the orthogonal Procrustes approach. Computationally, we propose an efficient and fast Gibbs sampling approach. Through an extensive simulation analysis, we show that the proposed approach performs very well in a range of different scenarios, and outperforms standard Factor analysis in all the scenarios identifying replicable signal in unsupervised genomic applications. The results of our analysis of breast cancer gene expression across seven studies identified replicable gene patterns, clearly related to well-known breast cancer pathways. An R package is implemented and available on GitHub.

[1]  T. Barrette,et al.  Meta-analysis of microarrays: interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. , 2002, Cancer research.

[2]  Lorenzo Trippa,et al.  Multi‐study factor analysis , 2016, Biometrics.

[3]  David Causeur,et al.  A factor model to analyze heterogeneity in gene expression , 2010, BMC Bioinformatics.

[4]  C. Croce,et al.  MicroRNA gene expression deregulation in human breast cancer. , 2005, Cancer research.

[5]  Peter Regitnig,et al.  Genomic index of sensitivity to endocrine therapy for breast cancer. , 2010, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[6]  G Leclercq,et al.  About GATA3, HNF3A, and XBP1, three genes co-expressed with the oestrogen receptor-α gene (ESR1) in breast cancer , 2004, Molecular and Cellular Endocrinology.

[7]  G. Church,et al.  Systematic management and analysis of yeast gene expression data. , 2000, Genome research.

[8]  Christopher D. Brown,et al.  A latent factor model with a mixture of sparse and dense factors to model gene expression data with confounding effects , 2013, 1310.4792.

[9]  R. Weigel,et al.  GATA‐3 is expressed in association with estrogen receptor in breast cancer , 1999, International journal of cancer.

[10]  C. Planey,et al.  CoINcIDE: A framework for discovery of patient subtypes across multiple datasets , 2016, Genome Medicine.

[11]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[12]  C. Vidal,et al.  Reproducibility of data-driven dietary patterns in two groups of adult Spanish women from different studies , 2016, British Journal of Nutrition.

[13]  H. Kaiser The varimax criterion for analytic rotation in factor analysis , 1958 .

[14]  H. Kölbl,et al.  The humoral immune system has a key prognostic impact in node-negative breast cancer. , 2008, Cancer research.

[15]  Age K. Smilde,et al.  Real-life metabolomics data analysis : how to deal with complex data ? , 2010 .

[16]  V. Theodorou,et al.  GATA3 acts upstream of FOXA1 in mediating ESR1 binding by shaping enhancer accessibility , 2013, Genome research.

[17]  Olga G. Troyanskaya,et al.  A scalable method for integration and functional analysis of multiple microarray datasets , 2006, Bioinform..

[18]  Stephen P. Fox,et al.  Co-regulated gene expression by oestrogen receptor α and liver receptor homolog-1 is a feature of the oestrogen response in breast cancer cells , 2013, Nucleic acids research.

[19]  Ajay N. Jain,et al.  Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. , 2006, Cancer cell.

[20]  I. Ellis,et al.  Differential oestrogen receptor binding is associated with clinical outcome in breast cancer , 2011, Nature.

[21]  Michael A. West,et al.  BAYESIAN MODEL ASSESSMENT IN FACTOR ANALYSIS , 2004 .

[22]  K. Keyomarsi,et al.  Redundant cyclin overexpression and gene amplification in breast cancer cells. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[23]  E. George,et al.  Fast Bayesian Factor Analysis via Automatic Rotations to Sparsity , 2016 .

[24]  Sayan Mukherjee,et al.  Dissecting High-Dimensional Phenotypes with Bayesian Sparse Factor Analysis of Genetic Covariance Matrices , 2012, Genetics.

[25]  R. Schiff,et al.  Estrogen receptor: current understanding of its activation and modulation. , 2001, Clinical cancer research : an official journal of the American Association for Cancer Research.

[26]  Daniel R. Salomon,et al.  Strategies for aggregating gene expression data: The collapseRows R function , 2011, BMC Bioinformatics.

[27]  J. Geweke,et al.  Measuring the pricing error of the arbitrage pricing theory , 1996 .

[28]  Edith M. Ross,et al.  Regulators of genetic risk of breast cancer identified by integrative network analysis , 2015, Nature Genetics.

[29]  A. Charchanti,et al.  Immunohistochemical expression of extracellular matrix components tenascin, fibronectin, collagen type IV and laminin in breast cancer: their prognostic value and role in tumour invasion and progression. , 2002, European journal of cancer.

[30]  Philip M. Long,et al.  Breast cancer classification and prognosis based on gene expression profiles from a population-based study , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[31]  P. Khatri,et al.  A systems biology approach for pathway level analysis. , 2007, Genome research.

[32]  M. J. van de Vijver,et al.  Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. , 2006, Journal of the National Cancer Institute.

[33]  C. Sander,et al.  Collection, integration and analysis of cancer genomic profiles: from data to insight. , 2014, Current opinion in genetics & development.

[34]  Giovanni Parmigiani,et al.  Integrating diverse genomic data using gene sets , 2011, Genome Biology.

[35]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[36]  A. Nobel,et al.  The molecular portraits of breast tumors are conserved across microarray platforms , 2006, BMC Genomics.

[37]  H. Morgenstern,et al.  Nutrient-based dietary patterns and the risk of head and neck cancer: a pooled analysis in the International Head and Neck Cancer Epidemiology consortium. , 2012, Annals of oncology : official journal of the European Society for Medical Oncology.

[38]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[39]  C. Huttenhower,et al.  Risk prediction for late-stage ovarian cancer by meta-analysis of 1525 patient samples. , 2014, Journal of the National Cancer Institute.

[40]  Chloé Friguet,et al.  A Factor Model Approach to Multiple Testing Under Dependence , 2009 .

[41]  Ian C. McDowell,et al.  Differential gene co-expression networks via Bayesian biclustering models , 2014, 1411.1997.

[42]  Christian Aßmann,et al.  Bayesian analysis of static and dynamic factor models: An ex-post approach towards the rotation problem , 2016 .

[43]  Andy J. Minn,et al.  Genes that mediate breast cancer metastasis to lung , 2005, Nature.

[44]  Stefano Monti,et al.  Gene expression profiling reveals reproducible human lung adenocarcinoma subtypes in multiple independent patient cohorts. , 2006, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[45]  A. Nobel,et al.  Supervised risk predictor of breast cancer based on intrinsic subtypes. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[46]  R. Tibshirani,et al.  Repeated observation of breast tumor subtypes in independent gene expression data sets , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[47]  Jean Thioulouse,et al.  CO‐INERTIA ANALYSIS AND THE LINKING OF ECOLOGICAL DATA TABLES , 2003 .

[48]  Terence P. Speed,et al.  Unifying Gene Expression Measures from Multiple Platforms Using Factor Analysis , 2011, PloS one.

[49]  David Chen,et al.  ESR1 ligand binding domain mutations in hormone-resistant breast cancer , 2013, Nature Genetics.

[50]  Kathleen F. Kerr,et al.  Extended analysis of benchmark datasets for Agilent two-color microarrays , 2007, BMC Bioinformatics.

[51]  M. Hung,et al.  β-Catenin, a novel prognostic marker for breast cancer: Its roles in cyclin D1 expression and cancer progression , 2000 .

[52]  H. Abdi,et al.  Multiple factor analysis: principal component analysis for multitable and multiblock data sets , 2013 .

[53]  V. Jordan,et al.  Chemoprevention of breast cancer with selective oestrogen-receptor modulators , 2007, Nature Reviews Cancer.

[54]  Anne-Laure Boulesteix,et al.  Cross-study validation for the assessment of prediction algorithms , 2014, Bioinform..

[55]  H. Ishwaran,et al.  Lung metastasis genes couple breast tumor size and metastatic spread , 2007, Proceedings of the National Academy of Sciences.

[56]  J. Bergh,et al.  Strong Time Dependence of the 76-Gene Prognostic Signature for Node-Negative Breast Cancer Patients in the TRANSBIG Multicenter Independent Validation Series , 2007, Clinical Cancer Research.

[57]  M. West,et al.  High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics , 2008, Journal of the American Statistical Association.

[58]  J. Hasty,et al.  Reverse engineering gene networks: Integrating genetic perturbations with dynamical modeling , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[59]  Snigdhansu Chatterjee,et al.  Procrustes Problems , 2005, Technometrics.

[60]  Javed Siddiqui,et al.  Activating ESR1 mutations in hormone-resistant metastatic breast cancer , 2013, Nature Genetics.

[61]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[62]  Maqc Consortium The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements , 2006, Nature Biotechnology.

[63]  Aedín C. Culhane,et al.  A multivariate approach to the integration of multi-omics datasets , 2014, BMC Bioinformatics.

[64]  Elizabeth Garrett-Mayer,et al.  Cross-study validation and combined analysis of gene expression microarray data. , 2007, Biostatistics.

[65]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[66]  Sayan Mukherjee,et al.  Bayesian group latent factor analysis with structured sparsity , 2014, 1411.2698.

[67]  N. Rosen,et al.  Ansamycin antibiotics inhibit Akt activation and cyclin D expression in breast cancer cells that overexpress HER2 , 2002, Oncogene.

[68]  Soonmyung Paik,et al.  Use of archived specimens in evaluation of prognostic and predictive biomarkers. , 2009, Journal of the National Cancer Institute.

[69]  John Quackenbush,et al.  A three-gene model to robustly identify breast cancer molecular subtypes. , 2012, Journal of the National Cancer Institute.

[70]  Funda Meric-Bernstam,et al.  Differential Response to Neoadjuvant Chemotherapy Among 7 Triple-Negative Breast Cancer Molecular Subtypes , 2013, Clinical Cancer Research.

[71]  Chris Sander,et al.  Emerging landscape of oncogenic signatures across human cancers , 2013, Nature Genetics.

[72]  Chuan Gao,et al.  Context Specific and Differential Gene Co-expression Networks via Bayesian Biclustering , 2016, PLoS Comput. Biol..

[73]  D. Dunson,et al.  Sparse Bayesian infinite factor models. , 2011, Biometrika.

[74]  Brooke L. Fridley,et al.  GWAS meta-analysis and replication identifies three new susceptibility loci for ovarian cancer , 2013, Nature Genetics.

[75]  Julien Textoris,et al.  Dysregulation of Ribosome Biogenesis and Translational Capacity Is Associated with Tumor Progression of Human Breast Cancer Cells , 2009, PloS one.