Batch effect removal methods for microarray gene expression data integration: a survey

Genomic data integration is a key goal to be achieved towards large-scale genomic data analysis. This process is very challenging due to the diverse sources of information resulting from genomics experiments. In this work, we review methods designed to combine genomic data recorded from microarray gene expression (MAGE) experiments. It has been acknowledged that the main source of variation between different MAGE datasets is due to the so-called 'batch effects'. The methods reviewed here perform data integration by removing (or more precisely attempting to remove) the unwanted variation associated with batch effects. They are presented in a unified framework together with a wide range of evaluation tools, which are mandatory in assessing the efficiency and the quality of the data integration process. We provide a systematic description of the MAGE data integration methodology together with some basic recommendation to help the users in choosing the appropriate tools to integrate MAGE data for large-scale analysis; and also how to evaluate them from different perspectives in order to quantify their efficiency. All genomic data used in this study for illustration purposes were retrieved from InSilicoDB http://insilico.ulb.ac.be.

[1]  D. Kuhn,et al.  Quantification of sources of variation and accuracy of sequence discrimination in a replicated microarray experiment. , 2004, BioTechniques.

[2]  I. Jolliffe Principal Component Analysis , 2002 .

[3]  Hyun Cheol Chung,et al.  An attempt for combining microarray data sets by adjusting gene expressions. , 2007, Cancer research and treatment : official journal of Korean Cancer Association.

[4]  Charles E McCulloch,et al.  Empirical Bayes accomodation of batch-effects in microarray data using identical replicate reference samples: application to RNA expression profiling of blood from Duchenne muscular dystrophy patients , 2008, BMC Genomics.

[5]  Steven Shuangge Ma,et al.  Integrative analysis of cancer genomic data , 2010 .

[6]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[7]  K. V. Donkena,et al.  Batch effect correction for genome-wide methylation data with Illumina Infinium platform , 2011, BMC Medical Genomics.

[8]  Mayte Suárez-Fariñas,et al.  Harshlight: a "corrective make-up" program for microarray chips , 2005, BMC Bioinformatics.

[9]  Susan G Hilsenbeck,et al.  Reproducibility, sources of variability, pooling, and sample size: important considerations for the design of high-density oligonucleotide array experiments. , 2004, The journals of gerontology. Series A, Biological sciences and medical sciences.

[10]  John Quackenbush,et al.  Sources of variation in baseline gene expression levels from toxicogenomics study control animals across multiple laboratories , 2008, BMC Genomics.

[11]  R. Irizarry,et al.  A gene expression bar code for microarray data , 2007, Nature Methods.

[12]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[13]  C. Li,et al.  Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Hugues Bersini,et al.  A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray Analysis , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[15]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[16]  Yipeng Wang,et al.  WebArrayDB: cross-platform microarray data analysis and public data repository , 2009, Bioinform..

[17]  H. Parkinson,et al.  A global map of human gene expression , 2010, Nature Biotechnology.

[18]  Matthias Nees,et al.  Impact of pre‐analytical handling on bone marrow mRNA gene expression , 2004, British journal of haematology.

[19]  Rafael A. Irizarry,et al.  A summarization approach for Affymetrix GeneChip data using a reference training set from a large, biologically diverse database , 2006, BMC Bioinformatics.

[20]  Matthew N. McCall,et al.  The Gene Expression Barcode: leveraging public data repositories to begin cataloging the human and murine transcriptomes , 2010, Nucleic Acids Res..

[21]  Ibrahim Emam,et al.  ArrayExpress update—an archive of microarray and high-throughput sequencing-based functional genomics experiments , 2010, Nucleic Acids Res..

[22]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[23]  P. Brown,et al.  Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[24]  A. Khuri,et al.  Variance Components Analysis: A Selective Literature Survey , 1985 .

[25]  Crispin J. Miller,et al.  The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets – improving meta-analysis and prediction of prognosis , 2008, BMC Medical Genomics.

[26]  Hugues Bersini,et al.  inSilicoDb: an R/Bioconductor package for accessing human Affymetrix expert-curated datasets from GEO. , 2011, Bioinformatics.

[27]  A. Scherer Batch Effects and Noise in Microarray Experiments , 2009 .

[28]  Terence P. Speed,et al.  Quality Assessment for Short Oligonucleotide Microarray Data , 2007, Technometrics.

[29]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[30]  Roland Eils,et al.  Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes , 2005, BMC Bioinformatics.

[31]  Joel S. Parker,et al.  Adjustment of systematic microarray data biases , 2004, Bioinform..

[32]  Eric P. Hoffman,et al.  Sources of variability and effect of experimental approach on expression profiling data interpretation , 2002, BMC Bioinformatics.

[33]  Faramarz Valafar,et al.  Empirical comparison of cross-platform normalization methods for gene expression data , 2011, BMC Bioinformatics.

[34]  Daniel Eriksson,et al.  Orthogonal projections to latent structures as a strategy for microarray data normalization , 2007, BMC Bioinformatics.

[35]  Jihoon Kim,et al.  DSGeo: Software tools for cross-platform analysis of gene expression data in GEO , 2010, J. Biomed. Informatics.

[36]  Kenneth H Buetow,et al.  Interlaboratory comparability study of cancer gene expression analysis using oligonucleotide microarrays. , 2005, Clinical cancer research : an official journal of the American Association for Cancer Research.

[37]  Jun Chen,et al.  Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes , 2004, BMC Bioinformatics.

[38]  Chunyu Liu,et al.  Removing Batch Effects in Analysis of Expression Microarray Data: An Evaluation of Six Batch Adjustment Methods , 2011, PloS one.

[39]  Shibing Deng,et al.  Cross-site comparison of gene expression data reveals high similarity. , 2004, Environmental health perspectives.

[40]  Joaquín Dopazo,et al.  The role of the environment in Parkinson's disease. , 1996, Nucleic Acids Res..

[41]  Rudolph S. Parrish,et al.  BMC Bioinformatics BioMed Central Research article Sources of variation in Affymetrix microarray experiments , 2005 .

[42]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[43]  Matthew N. McCall,et al.  Thawing Frozen Robust Multi-array Analysis (fRMA) , 2011, BMC Bioinformatics.

[44]  C. Sotiriou,et al.  Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures , 2007, Breast Cancer Research.

[45]  Laurent Briollais,et al.  How to Deal with Batch Effect in Sequential Microarray Experiments? , 2010, Molecular informatics.

[46]  Wei-Min Liu,et al.  Robust estimators for expression analysis , 2002, Bioinform..

[47]  Ki-Yeol Kim,et al.  Novel and simple transformation algorithm for combining microarray data sets , 2007, BMC Bioinformatics.

[48]  Rafael A Irizarry,et al.  Frozen robust multiarray analysis (fRMA). , 2010, Biostatistics.

[49]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[50]  Andreas Scherer,et al.  Batch Effects and Noise in Microarray Experiments: Sources and Solutions , 2009 .

[51]  L. Ein-Dor,et al.  Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[52]  Giovanni Parmigiani,et al.  The Integrative Correlation Coefficient: a Measure of Cross-study Reproducibility for Gene Expressionea Array Data , 2007 .

[53]  Andrew B. Nobel,et al.  Merging two gene-expression studies via cross-platform normalization , 2008, Bioinform..

[54]  A. Chinnaiyan,et al.  Integrative analysis of the cancer transcriptome , 2005, Nature Genetics.

[55]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[56]  Pan Du,et al.  lumi: a pipeline for processing Illumina microarray , 2008, Bioinform..

[57]  Ana Conesa,et al.  ARSyN: a method for the identification and removal of systematic noise in multifactorial time course microarray experiments. , 2012, Biostatistics.

[58]  Johann A. Gagnon-Bartsch,et al.  Using control genes to correct for unwanted variation in microarray data. , 2012, Biostatistics.

[59]  David Botstein,et al.  BMC Genomics BioMed Central Methodology article Universal Reference RNA as a standard for microarray experiments , 2004 .

[60]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[61]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.