Batch-normalization of cerebellar and medulloblastoma gene expression datasets utilizing empirically defined negative control genes

Abstract Motivation Medulloblastoma (MB) is a brain cancer predominantly arising in children. Roughly 70% of patients are cured today, but survivors often suffer from severe sequelae. MB has been extensively studied by molecular profiling, but often in small and scattered cohorts. To improve cure rates and reduce treatment side effects, accurate integration of such data to increase analytical power will be important, if not essential. Results We have integrated 23 transcription datasets, spanning 1350 MB and 291 normal brain samples. To remove batch effects, we combined the Removal of Unwanted Variation (RUV) method with a novel pipeline for determining empirical negative control genes and a panel of metrics to evaluate normalization performance. The documented approach enabled the removal of a majority of batch effects, producing a large-scale, integrative dataset of MB and cerebellar expression data. The proposed strategy will be broadly applicable for accurate integration of data and incorporation of normal reference samples for studies of various diseases. We hope that the integrated dataset will improve current research in the field of MB by allowing more large-scale gene expression analyses. Availability and implementation The RUV-normalized expression data is available through the Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/) and can be accessed via the GSE series number GSE124814. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Steven J. M. Jones,et al.  Subgroup-specific structural variation across 1,000 medulloblastoma genomes , 2012, Nature.

[2]  Soumen Khatua,et al.  Medulloblastoma development: tumor biology informs treatment decisions. , 2015, CNS oncology.

[3]  Hendrik Witt,et al.  Medulloblastoma comprises four distinct molecular variants. , 2011, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[4]  Terence P. Speed,et al.  Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed , 2012, Biostatistics.

[5]  Roland Eils,et al.  The whole-genome landscape of medulloblastoma subtypes , 2017, Nature.

[6]  Hugues Bersini,et al.  Batch effect removal methods for microarray gene expression data integration: a survey , 2013, Briefings Bioinform..

[7]  Steven J. M. Jones,et al.  Aberrant patterns of H3K4 and H3K27 histone lysine methylation occur across subgroups in medulloblastoma , 2013, Acta Neuropathologica.

[8]  Andreas Heider,et al.  virtualArray: a R/bioconductor package to merge raw data from different microarray platforms , 2013, BMC Bioinformatics.

[9]  E. Levanon,et al.  Human housekeeping genes, revisited. , 2013, Trends in genetics : TIG.

[10]  Scott L. Pomeroy,et al.  Rapid, reliable, and reproducible molecular sub-grouping of clinical medulloblastoma samples , 2011, Acta Neuropathologica.

[11]  Marco Giordan,et al.  A Two-Stage Procedure for the Removal of Batch Effects in Microarray Studies , 2013, Statistics in Biosciences.

[12]  Scott L. Pomeroy,et al.  Molecular subgroups of medulloblastoma: an international meta-analysis of transcriptome, genetic aberrations, and clinical data of WNT, SHH, Group 3, and Group 4 medulloblastomas , 2012, Acta Neuropathologica.

[13]  E. Levanon,et al.  Human housekeeping genes are compact. , 2003, Trends in genetics : TIG.

[14]  Johann A. Gagnon-Bartsch,et al.  Using control genes to correct for unwanted variation in microarray data. , 2012, Biostatistics.

[15]  Scott Pomeroy,et al.  The evolution of medulloblastoma therapy to personalized medicine , 2017, F1000Research.

[16]  A. Barabasi,et al.  Network medicine : a network-based approach to human disease , 2010 .

[17]  Chunyu Liu,et al.  Removing Batch Effects in Analysis of Expression Microarray Data: An Evaluation of Six Batch Adjustment Methods , 2011, PloS one.

[18]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[19]  Nathan C. Sheffield,et al.  Predicting cell-type–specific gene expression from regions of open chromatin , 2012, Genome research.

[20]  Gary D Bader,et al.  Enhancer hijacking activates GFI1 family oncogenes in medulloblastoma , 2014, Nature.

[21]  Andrew E. Jaffe,et al.  Bioinformatics Applications Note Gene Expression the Sva Package for Removing Batch Effects and Other Unwanted Variation in High-throughput Experiments , 2022 .

[22]  Alexander A. Morgan,et al.  Multiplex Meta-Analysis of Medulloblastoma Expression Studies with External Controls , 2014, Pacific Symposium on Biocomputing.

[23]  E. Hovig,et al.  Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses , 2015, Biostatistics.

[24]  D. Louis WHO classification of tumours of the central nervous system , 2007 .

[25]  Paul A. Northcott,et al.  DNA methylation profiling of medulloblastoma allows robust subclassification and improved outcome prediction using formalin-fixed biopsies , 2013, Acta Neuropathologica.

[26]  Sirintra Nakjang,et al.  Novel molecular subgroups for clinical classification and outcome prediction in childhood medulloblastoma: a cohort study , 2017, The Lancet. Oncology.

[27]  David T. W. Jones,et al.  Genomic and transcriptomic analyses match medulloblastoma mouse models to their human counterparts , 2014, Acta Neuropathologica.

[28]  Jeffrey T. Leek,et al.  Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction , 2014, Bioinform..

[29]  A. Goldenberg,et al.  Intertumoral Heterogeneity within Medulloblastoma Subgroups. , 2017, Cancer cell.

[30]  A. Barabasi,et al.  Interactome Networks and Human Disease , 2011, Cell.

[31]  C. Nordborg,et al.  MethPed: a DNA methylation classifier tool for the identification of pediatric brain tumor subtypes , 2015, Clinical Epigenetics.

[32]  Elaine R. Mardis,et al.  Novel mutations target distinct subgroups of medulloblastoma , 2012, Nature.

[33]  Yufeng Liu,et al.  R/DWD: distance-weighted discrimination for classification, visualization and batch adjustment , 2012, Bioinform..

[34]  Marzieh Vali,et al.  Tumor-Associated Macrophages in SHH Subgroup of Medulloblastomas , 2014, Clinical Cancer Research.

[35]  Scott L. Pomeroy,et al.  Molecular subgroups of medulloblastoma: the current consensus , 2011, Acta Neuropathologica.

[36]  David T. W. Jones,et al.  Decoding the regulatory landscape of medulloblastoma using DNA methylation sequencing , 2014, Nature.