Covariance adjustment for batch effect in gene expression data

Batch bias has been found in many microarray gene expression studies that involve multiple batches of samples. A serious batch effect can alter not only the distribution of individual genes but also the inter-gene relationships. Even though some efforts have been made to remove such bias, there has been relatively less development on a multivariate approach, mainly because of the analytical difficulty due to the high-dimensional nature of gene expression data. We propose a multivariate batch adjustment method that effectively eliminates inter-gene batch effects. The proposed method utilizes high-dimensional sparse covariance estimation based on a factor model and a hard thresholding. Another important aspect of the proposed method is that if it is known that one of the batches is produced in a superior condition, the other batches can be adjusted so that they resemble the target batch. We study high-dimensional asymptotic properties of the proposed estimator and compare the performance of the proposed method with some popular existing methods with simulated data and gene expression data sets.

[1]  Igor Jurisica,et al.  Gene expression–based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study , 2008, Nature Medicine.

[2]  W. D. de Vos,et al.  Comparative Analysis of Pyrosequencing and a Phylogenetic Microarray for Exploring Microbial Community Structures in the Human Distal Intestine , 2009, PloS one.

[3]  Weidong Liu,et al.  Adaptive Thresholding for Sparse Covariance Matrix Estimation , 2011, 1102.2237.

[4]  Rafael A Irizarry,et al.  Frozen robust multiarray analysis (fRMA). , 2010, Biostatistics.

[5]  Jeffrey T Leek,et al.  Statistical Applications in Genetics and Molecular Biology The practical effect of batch on genomic prediction , 2012 .

[6]  Andrew B. Nobel,et al.  Merging two gene-expression studies via cross-platform normalization , 2008, Bioinform..

[7]  Hanlee P. Ji,et al.  The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. , 2006, Nature biotechnology.

[8]  R. Tibshirani,et al.  On testing the significance of sets of genes , 2006, math/0610667.

[9]  Chi Song,et al.  Ratio adjustment and calibration scheme for gene-wise normalization to enhance microarray inter-study prediction , 2009, Bioinform..

[10]  Jens Lichtenberg,et al.  Word-based characterization of promoters involved in human DNA repair pathways , 2009, BMC Genomics.

[11]  L. Ein-Dor,et al.  Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Jeffrey S. Morris,et al.  Reproducibility of SELDI-TOF protein patterns in serum: comparing datasets from different experiments , 2004, Bioinform..

[13]  J. Clobert,et al.  Carotenoid-Based Colours Reflect the Stress Response in the Common Lizard , 2009, PloS one.

[14]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[15]  Jianqing Fan,et al.  High dimensional covariance matrix estimation using a factor model , 2007, math/0701124.

[16]  John Quackenbush,et al.  Multiple-laboratory comparison of microarray platforms , 2005, Nature Methods.

[17]  Trevor Hastie,et al.  Regularized linear discriminant analysis and its application in microarrays. , 2007, Biostatistics.

[18]  A. Scherer Batch Effects and Noise in Microarray Experiments , 2009 .

[19]  Johann A. Gagnon-Bartsch,et al.  Using control genes to correct for unwanted variation in microarray data. , 2012, Biostatistics.

[20]  M. West,et al.  High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics , 2008, Journal of the American Statistical Association.

[21]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[22]  P. Bickel,et al.  Covariance regularization by thresholding , 2009, 0901.3079.

[23]  Hirokazu Yanagihara,et al.  Testing the equality of several covariance matrices with fewer observations than the dimension , 2010, J. Multivar. Anal..

[24]  Kenneth H Buetow,et al.  Interlaboratory comparability study of cancer gene expression analysis using oligonucleotide microarrays. , 2005, Clinical cancer research : an official journal of the American Association for Cancer Research.

[25]  J. S. Marron,et al.  Distance-Weighted Discrimination , 2007 .

[26]  J. Dopazo,et al.  Gene set internal coherence in the context of functional profiling , 2009, BMC Genomics.

[27]  Daniel Q. Naiman,et al.  Robust prostate cancer marker genes emerge from direct integration of inter-study microarray data , 2005, Bioinform..

[28]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[29]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[30]  T P Speed,et al.  A score test for the linkage analysis of qualitative and quantitative traits based on identity by descent data from sib-pairs. , 2000, Biostatistics.

[31]  R. Tibshirani,et al.  Efficient quadratic regularization for expression arrays. , 2004, Biostatistics.

[32]  Joel S. Parker,et al.  Adjustment of systematic microarray data biases , 2004, Bioinform..

[33]  P. Bucher,et al.  Can Survival Prediction Be Improved By Merging Gene Expression Data Sets? , 2009, PloS one.