Extracting replicable associations across multiple studies: algorithms for controlling the false discovery rate

In almost every field in genomics, large-scale biomedical datasets are used to report associations. Extracting associations that recur across multiple studies while controlling the false discovery rate is a fundamental challenge. Here, we consider an extension of Efron’s single-study twogroups model to allow joint analysis of multiple studies. We assume that given a set of p-values obtained from each study, the researcher is interested in associations that recur in at least k > 1 studies. We propose new algorithms that differ in how the study dependencies are modeled. We compared our new methods and others using various simulated scenarios. The top performing algorithm, SCREEN (Scalable Cluster-based REplicability ENhancement), is our new algorithm that is based on three stages: (1) clustering an estimated correlation network of the studies, (2) learning replicability (e.g., of genes) within clusters, and (3) merging the results across the clusters using dynamic programming. We applied SCREEN to two real datasets and demonstrated that it greatly outperforms the results obtained via standard meta-analysis. First, on a collection of 29 case-control large-scale gene expression cancer studies, we detected a large up-regulated module of genes related to proliferation and cell cycle regulation. These genes are both consistently up-regulated across many cancer studies, and are well connected in known gene networks. Second, on a recent pan-cancer study that examined the expression profiles of patients with or without mutations in the HLA complex, we detected an active module of up-regulated genes that are related to immune responses. Thanks to our ability to quantify the false discovery rate, we detected thrice more genes as compared to the original study. Our module contains most of the genes reported in the original study, and many new ones. Interestingly, the newly discovered genes are needed to establish the connectivity of the module. Lay Summary When analyzing results from multiple studies, extracting replicated associations is the first step towards making new discoveries. The standard approach for this task is to use meta-analysis methods, which usually make an underlying null hypothesis that a gene has no effect in all 1 ar X iv :1 60 9. 01 11 8v 2 [ st at .M E ] 7 S ep 2 01 6 studies. On the other hand, in replicability analysis we explicitly require that the gene will manifest a recurring pattern of effects. In this study we develop new algorithms for replicability analysis that are both scalable (i.e., can handle many studies) and allow controlling the false discovery rate. We show that our main algorithm called SCREEN (Scalable Cluster-based REplicability ENhancement) outperforms the other methods in simulated scenarios. Moreover, when applied to real datasets, SCREEN greatly extended the results of the meta-analysis, and can even facilitate detection of new biological results. Introduction Confidence in reported findings is a prerequisite for advancing any scientific field. Such confidence is achieved by showing replication of discoveries by further evidence from new studies [1]. In recent years, a new type of methodology called replicability analysis, sometimes referred to as reproducibility analysis, was suggested as a way to statistically quantify the replication of discoveries across studies while controlling for the false discovery rate (FDR) [2]. This type of analysis is crucial in studies that aim to detect new hypotheses by integrating existing data from multiple high-throughput experiments. The practical importance of replicability analysis is twofold. First, it is a tool for quantifying replication, and therefore the reliability, of reported results. This is of vital importance as in recent years concerns have been raised in several domains regarding low reproducibility, including economics [3], psychology [4], medicine [5], and biological studies that rely on high throughput experiments such as gene expression profiling [6,7], and network biology [8]. Second, collating information from multiple studies can lead to scientific results that may be beyond the reach of a single study. Indeed, replicability analysis was demonstrated as a tool for extracting new results by merging Genome Wide Association Studies (GWAS) [9]. The underlying assumption in standard meta-analysis is that the multiple studies estimate the same effect. Aggregating information across studies produces estimators with smaller measurement error that yield considerably more power to reject the null hypothesis regarding this effect. While meta-analyses are widely applied and have been extensively studied in the statistical literature [10] and in computational biology [11, 12], in recent years the changes in the scale and also the scope of public high-throughput biomedical data has led to new methodological challenges. For example, Zeggini et al. [13] analyzed results of genome-wide association scans for Type 2 Diabetes (T2D) on the same set of almost 2.5 million SNPs from eight study populations. In such situations, the first, and more obvious, challenge is accounting for inflation in the number of false discoveries due to the multiplicity of outcomes. The second challenge is hidden in the null hypothesis that the effect size is 0 in all the studies (as done in meta-analysis). That assumption is oblivious to the consistency of the effects, and thus it overlooks important scientific information. Third, there is a need to distinguish between true effects that are specific to a single study and true effects that represent general discoveries that are replicable. For example, Kraft et al. [14] suggested that for common genetic

[1]  Sara Ballouz,et al.  Positive and negative forms of replicability in gene network analysis , 2016, Bioinform..

[2]  Gideon Nave,et al.  Evaluating replicability of laboratory experiments in economics , 2016, Science.

[3]  Roman Rouzier,et al.  Low Concordance between Gene Expression Signatures in ER Positive HER2 Negative Breast Carcinoma Could Impair Their Clinical Application , 2016, PloS one.

[4]  K. Cibulskis,et al.  Comprehensive analysis of cancer-associated somatic mutations in class I HLA genes , 2015, Nature Biotechnology.

[5]  Ron Shamir,et al.  Integrated analysis of numerous heterogeneous gene expression profiles for detecting robust disease-specific biomarkers and proposing drug targets , 2015, Nucleic acids research.

[6]  Gary D. Bader,et al.  Novel function discovery with GeneMANIA: a new integrated resource for gene function prediction in Escherichia coli , 2015, Bioinform..

[7]  Ruth Heller,et al.  Repfdr: a Tool for Replicability Analysis for Genome-wide Association Studies , 2014, Bioinform..

[8]  George C Tseng,et al.  HYPOTHESIS SETTING AND ORDER STATISTIC FOR ROBUST GENOMIC META-ANALYSIS. , 2014, The annals of applied statistics.

[9]  Debashis Ghosh,et al.  Meta-analysis based on weighted ordered P-values for genomic data with heterogeneity , 2014, BMC Bioinformatics.

[10]  F. Thoemmes,et al.  Continuously Cumulating Meta-Analysis and Replicability , 2014, Perspectives on psychological science : a journal of the Association for Psychological Science.

[11]  Burkhard Morgenstern,et al.  Meta-Analysis of Pathway Enrichment: Combining Independent and Dependent Omics Data Sets , 2014, PloS one.

[12]  Yoav Benjamini,et al.  Deciding whether follow-up studies have replicated findings in a preliminary large-scale omics study , 2013, Proceedings of the National Academy of Sciences.

[13]  Ruth Heller,et al.  Replicability analysis for genome-wide association studies , 2012, 1209.2829.

[14]  George C. Tseng,et al.  Meta-analysis methods for combining multiple expression profiles: comparisons, statistical characterization and an application guideline , 2013, BMC Bioinformatics.

[15]  Jochen Gaedcke,et al.  Integration of Metabolomics and Transcriptomics Revealed a Fatty Acid Network Exerting Growth Inhibitory Effects in Human Pancreatic Cancer , 2013, Clinical Cancer Research.

[16]  Il-Jin Kim,et al.  Rewiring of human lung cell lineage and mitotic networks in lung adenocarcinomas , 2013, Nature Communications.

[17]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[18]  Christian P. Robert,et al.  Large-scale inference , 2010 .

[19]  Isaac Dialsingh,et al.  Large-scale inference: empirical Bayes methods for estimation, testing, and prediction , 2012 .

[20]  Itzhak Avital,et al.  Genomic and genetic characterization of cholangiocarcinoma identifies therapeutic targets for tyrosine kinase inhibitors. , 2012, Gastroenterology.

[21]  S. Royle,et al.  The role of clathrin in mitotic spindle organisation , 2012, Journal of Cell Science.

[22]  E. Huang,et al.  Integrating Factor Analysis and a Transgenic Mouse Model to Reveal a Peripheral Blood Predictor of Breast Tumors , 2011, BMC Medical Genomics.

[23]  Nan Hu,et al.  A Gene Expression Signature from Peripheral Whole Blood for Stage I Lung Adenocarcinoma , 2011, Cancer Prevention Research.

[24]  M. Delorenzi,et al.  Identification of Prognostic Molecular Features in the Reactive Stroma of Human Breast and Prostate Cancer , 2011, PloS one.

[25]  Albert Gutierrez,et al.  LEF-1 is a prosurvival factor in chronic lymphocytic leukemia and is expressed in the preleukemic state of monoclonal B-cell lymphocytosis. , 2010, Blood.

[26]  Nan Hu,et al.  Genome wide analysis of DNA copy number neutral loss of heterozygosity (CNNLOH) and its relation to gene expression in esophageal squamous cell carcinoma , 2010, BMC Genomics.

[27]  Gary D. Bader,et al.  GeneMANIA Cytoscape plugin: fast gene function predictions on the desktop , 2010, Bioinform..

[28]  B. Efron Correlated z-Values and the Accuracy of Large-Scale Statistical Estimates , 2010, Journal of the American Statistical Association.

[29]  Chuhsing Kate Hsiao,et al.  Identification of a Novel Biomarker, SEMA5A, for Non–Small Cell Lung Carcinoma in Nonsmoking Women , 2010, Cancer Epidemiology, Biomarkers & Prevention.

[30]  Richard D Kolodner,et al.  An overview of Cdk1-controlled targets and processes , 2010, Cell Division.

[31]  P. Sebastiani,et al.  Gene expression in histologically normal epithelium from breast cancer patients and from cancer-free prophylactic mastectomy patients shares a similar profile , 2010, British Journal of Cancer.

[32]  David Elashoff,et al.  Salivary transcriptomic biomarkers for detection of resectable pancreatic cancer. , 2010, Gastroenterology.

[33]  R. Sharan,et al.  Expander: from expression microarrays to networks and functions , 2010, Nature Protocols.

[34]  Yoav Benjamini,et al.  Selective inference in complex research , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[35]  Peter Kraft,et al.  Replication in genome-wide association studies. , 2009, Statistical science : a review journal of the Institute of Mathematical Statistics.

[36]  B. Efron Empirical Bayes Estimates for Large-Scale Prediction Problems , 2009, Journal of the American Statistical Association.

[37]  Y. Benjamini,et al.  Screening for Partial Conjunction Hypotheses , 2008, Biometrics.

[38]  Martin-Leo Hansmann,et al.  Origin and pathogenesis of nodular lymphocyte–predominant Hodgkin lymphoma as revealed by global gene expression analysis , 2008, The Journal of experimental medicine.

[39]  M. Mansukhani,et al.  Identification of copy number gain and overexpressed genes on chromosome arm 20q by an integrative genomic approach in cervical cancer: Potential role in progression , 2008, Genes, chromosomes & cancer.

[40]  Korbinian Strimmer,et al.  A unified approach to false discovery rate estimation , 2008, BMC Bioinformatics.

[41]  Xiao-Hua Zhou,et al.  Statistical Methods for Meta‐Analysis , 2008 .

[42]  S. Wacholder,et al.  Gene Expression Signature of Cigarette Smoking and Its Role in Lung Adenocarcinoma Development and Survival , 2008, PloS one.

[43]  Soheil Meshinchi,et al.  Identification of genes with abnormal expression changes in acute myeloid leukemia , 2008, Genes, chromosomes & cancer.

[44]  C. Sotiriou,et al.  Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures , 2007, Breast Cancer Research.

[45]  Victoria Kristina Perry,et al.  Gene expression abnormalities in histologically normal breast epithelium of breast cancer patients , 2007, International journal of cancer.

[46]  M. Newton Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis , 2008 .

[47]  Michal A. Kurowski,et al.  Transcriptome Profile of Human Colorectal Adenomas , 2007, Molecular Cancer Research.

[48]  Bin Nan,et al.  Gene expression analysis of preinvasive and invasive cervical squamous cell carcinomas identifies HOXC10 as a key mediator of invasion. , 2007, Cancer research.

[49]  Mala Sinha,et al.  Secreted Frizzled-Related Protein 1 Loss Contributes to Tumor Phenotype of Clear Cell Renal Cell Carcinoma , 2007, Clinical Cancer Research.

[50]  M. McCarthy,et al.  Replication of Genome-Wide Association Signals in UK Samples Reveals Risk Loci for Type 2 Diabetes , 2007, Science.

[51]  J. Miguel,et al.  Gene expression profiling of B lymphocytes and plasma cells from Waldenström's macroglobulinemia: comparison with expression patterns of the same cell counterparts from chronic lymphocytic leukemia, multiple myeloma and normal individuals , 2007, Leukemia.

[52]  P. Sebastiani,et al.  Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer , 2007, Nature Medicine.

[53]  K. Ho,et al.  A Susceptibility Gene Set for Early Onset Colorectal Cancer That Integrates Diverse Signaling Pathways: Implication for Tumorigenesis , 2007, Clinical Cancer Research.

[54]  Martin Rosvall,et al.  An information-theoretic framework for resolving community structure in complex networks , 2007, Proceedings of the National Academy of Sciences.

[55]  G. Turashvili,et al.  Novel markers for differentiation of lobular and ductal invasive breast carcinomas by laser microdissection and microarray analysis , 2007, BMC Cancer.

[56]  M. Newton,et al.  Genome-wide expression profiling reveals EBV-associated inhibition of MHC class I expression in nasopharyngeal carcinoma. , 2006, Cancer research.

[57]  Geoffrey J. McLachlan,et al.  A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays , 2006, Bioinform..

[58]  Jayant P. Menon,et al.  Neuronal and glioma-derived stem cell factor induces angiogenesis within the brain. , 2006, Cancer cell.

[59]  B. Efron Correlation and Large-Scale Simultaneous Significance Testing , 2007 .

[60]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.

[61]  N. Segal,et al.  Analysis of hypoxia-related gene expression in sarcomas and effect of hypoxia on RNA interference of vascular endothelial cell growth factor A. , 2005, Cancer research.

[62]  David J Sugarbaker,et al.  Tumorigenesis and Neoplastic Progression Identification of Novel Candidate Oncogenes and Tumor Suppressors in Malignant Pleural Mesothelioma Using Large-Scale Transcriptional Profiling , 2005 .

[63]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[64]  C. Larroque,et al.  Characterization of the cDNA and pattern of expression of a new gene over-expressed in human hepatomas and colonic tumors. , 1995, European journal of biochemistry.

[65]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .