Robustifying genomic classifiers to batch effects via ensemble learning

Genomic data are often produced in batches due to practical restrictions, which may lead to unwanted variation in data caused by discrepancies across processing batches. Such “batch effects” often have negative impact on downstream biological analysis and need careful consideration. In practice, batch effects are usually addressed by specifically designed software, which merge the data from different batches, then estimate batch effects and remove them from the data. Here we focus on classification and prediction problems, and propose a different strategy based on ensemble learning. We first develop prediction models within each batch, then integrate them through ensemble weighting methods. In contrast to the typical approach of removing batch effects from the merged data, our method integrates predictions rather than data. We provide a systematic comparison between these two strategies, using studies targeting diverse populations infected with tuberculosis. In one study, we simulated increasing levels of heterogeneity across random subsets of the study, which we treat as simulated batches. We then use the two methods to develop a genomic classifier for the binary indicator of disease status. We evaluate the accuracy of prediction in another independent study targeting a different population cohort. We observed a turning point in the level of heterogeneity, after which our strategy of integrating predictions yields better discrimination in independent validation than the traditional method of integrating the data. These observations provide practical guidelines for handling batch effects in the development and evaluation of genomic classifiers.

[1]  Johann A. Gagnon-Bartsch,et al.  Using control genes to correct for unwanted variation in microarray data. , 2012, Biostatistics.

[2]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[3]  L Harrison,et al.  Risk of Prediction … ? , 1987, Diabetes Care.

[4]  Gautam Roy,et al.  Existing blood transcriptional classifiers accurately discriminate active tuberculosis from latent infection in individuals from south India. , 2018, Tuberculosis.

[5]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[6]  T. Phang,et al.  Blood Transcriptional Biomarkers for Active Tuberculosis among Patients in the United States: a Case-Control Study with Systematic Cross-Classifier Evaluation , 2015, Journal of Clinical Microbiology.

[7]  J. Leek svaseq: removing batch effects and other unwanted noise from sequencing data , 2014, bioRxiv.

[8]  Anne-Laure Boulesteix,et al.  Cross-study validation for the assessment of prediction algorithms , 2014, Bioinform..

[9]  Prasad Patil,et al.  Merging versus Ensembling in Multi-Study Machine Learning: Theoretical Insight from Random Effects , 2019, ArXiv.

[10]  Prasad Patil,et al.  Training replicable predictors in multiple studies , 2018, Proceedings of the National Academy of Sciences.

[11]  C. Huttenhower,et al.  Risk prediction for late-stage ovarian cancer by meta-analysis of 1525 patient samples. , 2014, Journal of the National Cancer Institute.

[12]  Jonathan H. Chan,et al.  Handling batch effects on cross-platform classification of microarray data , 2016, Int. J. Adv. Intell. Paradigms.

[13]  Leo Breiman,et al.  Stacked regressions , 2004, Machine Learning.

[14]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.

[15]  Donald Geman,et al.  Tracking Cross-Validated Estimates of Prediction Error as Studies Accumulate , 2015 .

[16]  Daniel E. Zak,et al.  A prospective blood RNA signature for tuberculosis disease risk , 2016, The Lancet.

[17]  Reinhard Guthke,et al.  Batch correction of microarray data substantially improves the identification of genes differentially expressed in Rheumatoid Arthritis and Osteoarthritis , 2012, BMC Medical Genomics.

[18]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[19]  Benjamin Haibe-Kains,et al.  BatchQC: interactive software for evaluating sample and batch effects in genomic data , 2016, Bioinform..

[20]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[21]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[22]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[23]  Daniel E. Zak,et al.  Four‐Gene Pan‐African Blood Signature Predicts Progression to Tuberculosis , 2018, American journal of respiratory and critical care medicine.

[24]  G. Silvestri,et al.  A Bronchial Genomic Classifier for the Diagnostic Evaluation of Lung Cancer. , 2015, The New England journal of medicine.

[25]  Giovanni Parmigiani,et al.  The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models. , 2018, Biostatistics.

[26]  Prasad Patil,et al.  Tree-Weighting for Multi-Study Ensemble Learners , 2019, bioRxiv.

[27]  K. Badani,et al.  Effect of a genomic classifier test on clinical practice decisions for patients with high-risk prostate cancer after surgery , 2014, BJU international.

[28]  Nicola D. Roberts,et al.  Genomic Classification and Prognosis in Acute Myeloid Leukemia. , 2016, The New England journal of medicine.

[29]  G. Dougan,et al.  The Key Role of Genomics in Modern Vaccine and Drug Design for Emerging Infectious Diseases , 2009, PLoS genetics.

[30]  Jaeyun Sung,et al.  Measuring the Effect of Inter-Study Variability on Estimating Prediction Error , 2014, PloS one.

[31]  Joel S. Parker,et al.  Adjustment of systematic microarray data biases , 2004, Bioinform..

[32]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .