Multi‐study factor analysis

We introduce a novel class of factor analysis methodologies for the joint analysis of multiple studies. The goal is to separately identify and estimate (1) common factors shared across multiple studies, and (2) study-specific factors. We develop an Expectation Conditional-Maximization algorithm for parameter estimates and we provide a procedure for choosing the numbers of common and specific factors. We present simulations for evaluating the performance of the method and we illustrate it by applying it to gene expression data in ovarian cancer. In both, we clarify the benefits of a joint analysis compared to the standard factor analysis. We have provided a tool to accelerate the pace at which we can combine unsupervised analysis across multiple studies, and understand the cross-study reproducibility of signal in multivariate data. An R package (MSFA), is implemented and is available on GitHub.

[1]  Jean Thioulouse,et al.  CO‐INERTIA ANALYSIS AND THE LINKING OF ECOLOGICAL DATA TABLES , 2003 .

[2]  C. Loan The ubiquitous Kronecker product , 2000 .

[3]  Anne-Laure Boulesteix,et al.  Cross-study validation for the assessment of prediction algorithms , 2014, Bioinform..

[4]  Jiahua Chen,et al.  Extended Bayesian information criteria for model selection with large model spaces , 2008 .

[5]  Hedibert Freitas Lopes,et al.  Parsimonious Bayesian Factor Analysis when the Number of Factors is Unknown , 2010 .

[6]  Edgar C. Merkle,et al.  The problem of model selection uncertainty in structural equation modeling. , 2012, Psychological methods.

[7]  J. Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[8]  Michael A. West,et al.  BAYESIAN MODEL ASSESSMENT IN FACTOR ANALYSIS , 2004 .

[9]  Kathleen F. Kerr,et al.  Extended analysis of benchmark datasets for Agilent two-color microarrays , 2007, BMC Bioinformatics.

[10]  R. Cattell The Scree Test For The Number Of Factors. , 1966, Multivariate behavioral research.

[11]  Kei Hirose,et al.  Estimation of an oblique structure via penalized likelihood factor analysis , 2013, Comput. Stat. Data Anal..

[12]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[13]  Elizabeth Garrett-Mayer,et al.  Cross-study validation and combined analysis of gene expression microarray data. , 2007, Biostatistics.

[14]  Benjamin Haibe-Kains,et al.  curatedOvarianData: clinically annotated data for the ovarian cancer transcriptome , 2013, Database J. Biol. Databases Curation.

[15]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[16]  H. Abdi,et al.  Multiple factor analysis: principal component analysis for multitable and multiblock data sets , 2013 .

[17]  M. Austin,et al.  Characterising the reproducibility and reliability of dietary patterns among Yup'ik Alaska Native people , 2015, British Journal of Nutrition.

[18]  Giovanni Parmigiani,et al.  A Cross-Study Comparison of Gene Expression Studies for the Molecular Classification of Lung Cancer , 2004, Clinical Cancer Research.

[19]  P. Robert,et al.  A Unifying Tool for Linear Multivariate Statistical Methods: The RV‐Coefficient , 1976 .

[20]  Chloé Friguet,et al.  A Factor Model Approach to Multiple Testing Under Dependence , 2009 .

[21]  K. Tucker,et al.  A study of dietary patterns in the Mexican-American population and their association with obesity. , 2007, Journal of the American Dietetic Association.

[22]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[23]  Dorothy T. Thayer,et al.  EM algorithms for ML factor analysis , 1982 .

[24]  K. G. J8reskoC,et al.  Simultaneous Factor Analysis in Several Populations , 2007 .

[25]  Giovanni Parmigiani,et al.  A Bayesian Model for Cross-Study Differential Gene Expression , 2009, Journal of the American Statistical Association.

[26]  J. Horn A rationale and test for the number of factors in factor analysis , 1965, Psychometrika.

[27]  H. Morgenstern,et al.  Shared and Study-specific Dietary Patterns and Head and Neck Cancer Risk in an International Consortium , 2019, Epidemiology.

[28]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[29]  Benjamin Frederick Ganzfried,et al.  Comparative meta-analysis of prognostic gene signatures for late-stage ovarian cancer. , 2014, Journal of the National Cancer Institute.

[30]  Sayan Mukherjee,et al.  Dissecting High-Dimensional Phenotypes with Bayesian Sparse Factor Analysis of Genetic Covariance Matrices , 2012, Genetics.

[31]  Laurent Ozbun,et al.  A gene signature predicting for survival in suboptimally debulked patients with ovarian cancer. , 2008, Cancer research.

[32]  K. Adachi Factor Analysis with EM Algorithm Never Gives Improper Solutions when Sample Covariance and Initial Parameter Matrices Are Proper , 2013, Psychometrika.

[33]  A. Goldberger,et al.  Factor analysis by generalized least squares , 1972 .

[34]  Maqc Consortium The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements , 2006, Nature Biotechnology.

[35]  H. Kaiser The varimax criterion for analytic rotation in factor analysis , 1958 .

[36]  K. Jöreskog Some contributions to maximum likelihood factor analysis , 1967 .

[37]  Daniel Q. Naiman,et al.  Integrative correlation: Properties and relation to canonical correlations , 2014, J. Multivar. Anal..

[38]  J. H. Steiger Statistically based tests for the number of common factors , 1980 .

[39]  Zengo Furukawa,et al.  A General Framework for , 1991 .

[40]  H. Akaike A new look at the statistical model identification , 1974 .

[41]  D. Weinberger,et al.  Remission in schizophrenia: proposed criteria and rationale for consensus. , 2005, The American journal of psychiatry.

[42]  H. Morgenstern,et al.  Nutrient-based dietary patterns and the risk of head and neck cancer: a pooled analysis in the International Head and Neck Cancer Epidemiology consortium. , 2012, Annals of oncology : official journal of the European Society for Medical Oncology.

[43]  S. Dolédec,et al.  Co‐inertia analysis: an alternative method for studying species–environment relationships , 1994 .

[44]  Aedín C. Culhane,et al.  A multivariate approach to the integration of multi-omics datasets , 2014, BMC Bioinformatics.

[45]  Terence P. Speed,et al.  Unifying Gene Expression Measures from Multiple Platforms Using Factor Analysis , 2011, PloS one.

[46]  David Causeur,et al.  A factor model to analyze heterogeneity in gene expression , 2010, BMC Bioinformatics.

[47]  C. Huttenhower,et al.  Risk prediction for late-stage ovarian cancer by meta-analysis of 1525 patient samples. , 2014, Journal of the National Cancer Institute.

[48]  B. Flury Common Principal Components in k Groups , 1984 .

[49]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[50]  Benjamin J. Raphael,et al.  Integrated Genomic Analyses of Ovarian Carcinoma , 2011, Nature.

[51]  R. Tothill,et al.  Novel Molecular Subtypes of Serous and Endometrioid Ovarian Cancer Linked to Clinical Outcome , 2008, Clinical Cancer Research.

[52]  B. Byrne,et al.  Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. , 1989 .

[53]  R. Spoth,et al.  Evaluation of a social contextual model of delinquency: a cross-study replication. , 2002, Child development.

[54]  D. Dunson,et al.  Sparse Bayesian infinite factor models. , 2011, Biometrika.

[55]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[56]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[57]  J. Geweke,et al.  Measuring the pricing error of the arbitrage pricing theory , 1996 .

[58]  Xiao-Li Meng,et al.  Maximum likelihood estimation via the ECM algorithm: A general framework , 1993 .

[59]  W. Meredith Measurement invariance, factor analysis and factorial invariance , 1993 .

[60]  J. Tukey,et al.  Multiple-Factor Analysis , 1947 .

[61]  Giovanni Parmigiani,et al.  Integrating diverse genomic data using gene sets , 2011, Genome Biology.

[62]  M. West,et al.  High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics , 2008, Journal of the American Statistical Association.

[63]  K. Jöreskog A general approach to confirmatory maximum likelihood factor analysis , 1969 .