MIA: Multi-cohort Integrated Analysis for Biomarker Identification

Advanced high-throughput technologies have produced vast amounts of biological data. Data integration is the key to obtain the power needed to pinpoint the biological mechanisms and biomarkers of the underlying disease. Two critical drawbacks of computational approaches for data integration is that they do not account for study bias, as well as the noisy nature of molecular data. This leads to unreliable and inconsistent results, i.e., the results change drastically when the input is slightly perturbed or when additional datasets are added to the analysis. Here we propose a multi-cohort integrated approach, named MIA, for biomarker identification that is robust to noise and study bias. We deploy a leave-one-out strategy to avoid the disproportionate influence of a single cohort. We also utilize techniques from both p-value-based and effect-size-based meta-analyses to ensure that the identified genes are significantly impacted. We compare MIA versus classical approaches (Fisher's, Stouffer's, maxP, minP, and the additive method) using 7 microarray and 4 RNASeq datasets. For each approach, we construct a disease signature using 3 datasets and then classify patients from 8 remaining datasets. MIA outperforms all existing approaches in terms of both the highest sensitivity and specificity by accurately distinguishing symptomatic patients from healthy controls.

[1]  D. Harville Maximum Likelihood Approaches to Variance Component Estimation and to Related Problems , 1977 .

[2]  Rob Patro,et al.  Salmon provides fast and bias-aware quantification of transcript expression , 2017, Nature Methods.

[3]  Winnie S. Liang,et al.  Alzheimer's disease is associated with reduced expression of energy metabolism genes in posterior cingulate neurons , 2008, Proceedings of the National Academy of Sciences.

[4]  Harvey Goldstein,et al.  Multilevel Statistical Models: Goldstein/Multilevel Statistical Models , 2010 .

[5]  O. Kallenberg Foundations of Modern Probability , 2021, Probability Theory and Stochastic Modelling.

[6]  E. Suchman,et al.  The American Soldier: Adjustment During Army Life. , 1949 .

[7]  S. R. Searle,et al.  Restricted Maximum Likelihood (REML) Estimation of Variance Components in the Mixed Model , 1976 .

[8]  Cristina Mitrea,et al.  A novel bi-level meta-analysis approach: applied to biological pathway analysis , 2016, Bioinform..

[9]  Tin Chi Nguyen,et al.  TOMAS: A novel TOpology-aware Meta-Analysis approach applied to System biology , 2016, BCB.

[10]  L. H. C. Tippett The Methods of Statistics. , 1931 .

[11]  Jia Li,et al.  An adaptively weighted statistic for detecting differential gene expression when combining multiple transcriptomic studies , 2011, 1108.3180.

[12]  J. O. Irwin,et al.  ON THE FREQUENCY DISTRIBUTION OF THE MEANS OF SAMPLES FROM A POPULATION HAVING ANY LAW OF FREQUENCY WITH FINITE MOMENTS, WITH SPECIAL REFERENCE TO PEARSON'S TYPE II , 1927 .

[13]  Beau Dabbs,et al.  Summary and discussion of : “ Controlling the False Discovery Rate : A Practical and Powerful Approach to Multiple Testing , 2014 .

[14]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[15]  G. Tseng,et al.  Comprehensive literature review and statistical considerations for microarray meta-analysis , 2012, Nucleic acids research.

[16]  Maria Keays,et al.  ArrayExpress update—trends in database growth and links to data analysis tools , 2012, Nucleic Acids Res..

[17]  Dong Hee Kim,et al.  Genetic markers for diagnosis and pathogenesis of Alzheimer's disease. , 2014, Gene.

[18]  R. Lempicki,et al.  Evaluation of gene expression measurements from commercial microarray platforms. , 2003, Nucleic acids research.

[19]  Tin Chi Nguyen,et al.  Overcoming the matched-sample bottleneck: an orthogonal approach to integrate omic data , 2016, Scientific Reports.

[20]  Gavin C. Cawley,et al.  Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers , 2003, Pattern Recognit..

[21]  Hyungwon Choi,et al.  A Latent Variable Approach for Meta-Analysis of Gene Expression Data from Multiple Microarray Experiments , 2007, BMC Bioinformatics.

[22]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[23]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[24]  Hall Philip,et al.  THE DISTRIBUTION OF MEANS FOR SAMPLES OF SIZE N DRAWN FROM A POPULATION IN WHICH THE VARIATE TAKES VALUES BETWEEN 0 AND 1, ALL SUCH VALUES BEING EQUALLY PROBABLE , 1927 .

[25]  S. Drăghici,et al.  Network‐Based Approaches for Pathway Level Analysis , 2018, Current protocols in bioinformatics.

[26]  Aaron T. L. Lun,et al.  Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R , 2017, Bioinform..

[27]  B WILKINSON,et al.  A statistical consideration in psychological research. , 1951, Psychological bulletin.

[28]  Wolfgang Viechtbauer,et al.  Bias and Efficiency of Meta-Analytic Variance Estimators in the Random-Effects Model , 2005 .

[29]  H. Goldstein Multilevel Statistical Models , 2006 .

[30]  L. Ein-Dor,et al.  Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[31]  D. E. Johnson,et al.  Analysis of Messy Data Volume I: Designed Experiments , 1985 .

[32]  T. Barrette,et al.  Meta-analysis of microarrays: interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. , 2002, Cancer research.

[33]  Ben Bolstad,et al.  Low-level Analysis of High-density Oligonucleotide Array Data: Background, Normalization and Summarization , 2003 .

[34]  Cristina Mitrea,et al.  DANUBE: Data-Driven Meta-ANalysis Using UnBiased Empirical Distributions—Applied to Biological Pathway Analysis , 2017, Proceedings of the IEEE.

[35]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[36]  Gail M. Sullivan,et al.  Using Effect Size-or Why the P Value Is Not Enough. , 2012, Journal of graduate medical education.

[37]  Dallas E. Johnson,et al.  Analysis of Messy Data Volume 1: Designed Experiments, Second Edition , 2004 .