A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis

MOTIVATION Batch effects are due to probe-specific systematic variation between groups of samples (batches) resulting from experimental features that are not of biological interest. Principal component analysis (PCA) is commonly used as a visual tool to determine whether batch effects exist after applying a global normalization method. However, PCA yields linear combinations of the variables that contribute maximum variance and thus will not necessarily detect batch effects if they are not the largest source of variability in the data. RESULTS We present an extension of PCA to quantify the existence of batch effects, called guided PCA (gPCA). We describe a test statistic that uses gPCA to test whether a batch effect exists. We apply our proposed test statistic derived using gPCA to simulated data and to two copy number variation case studies: the first study consisted of 614 samples from a breast cancer family study using Illumina Human 660 bead-chip arrays, whereas the second case study consisted of 703 samples from a family blood pressure study that used Affymetrix SNP Array 6.0. We demonstrate that our statistic has good statistical properties and is able to identify significant batch effects in two copy number variation case studies. CONCLUSION We developed a new statistic that uses gPCA to identify whether batch effects exist in high-throughput genomic data. Although our examples pertain to copy number data, gPCA is general and can be used on other data types as well. AVAILABILITY AND IMPLEMENTATION The gPCA R package (Available via CRAN) provides functionality and data to perform the methods in this article. CONTACT reesese@vcu.edu

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  Thomas A. Louis,et al.  Quantifying uncertainty in genotype calls , 2010, Bioinform..

[3]  Alexander V. Alekseyenko,et al.  Visualization and Statistical Comparisons of Microbial Communities Using R Packages on Phylochip Data , 2011, Pacific Symposium on Biocomputing.

[4]  Yufeng Liu,et al.  R/DWD: distance-weighted discrimination for classification, visualization and batch adjustment , 2012, Bioinform..

[5]  InzaIñaki,et al.  Filter versus wrapper gene selection approaches in DNA microarray domains , 2004 .

[6]  Nicholas J. Schork,et al.  Preprocessing and Quality Control Strategies for Illumina DASL Assay-Based Brain Gene Expression Studies with Semi-Degraded Samples , 2012, Front. Gene..

[7]  Chunyu Liu,et al.  Removing Batch Effects in Analysis of Expression Microarray Data: An Evaluation of Six Batch Adjustment Methods , 2011, PloS one.

[8]  J. S. Marron,et al.  Distance-Weighted Discrimination , 2007 .

[9]  Joel S. Parker,et al.  Adjustment of systematic microarray data biases , 2004, Bioinform..

[10]  John Quackenbush,et al.  Integrated Analysis of Multiple Microarray Datasets Identifies a Reproducible Survival Predictor in Ovarian Cancer , 2011, PloS one.

[11]  Jonathon Shlens,et al.  A Tutorial on Principal Component Analysis , 2014, ArXiv.

[12]  A. Butte,et al.  Microarrays for an Integrative Genomics , 2002 .

[13]  Lesley Jones,et al.  Microarray Gene Expression Data Analysis: A Beginners Guide , 2004, Human Genetics.

[14]  Hugues Bersini,et al.  Batch effect removal methods for microarray gene expression data integration: a survey , 2013, Briefings Bioinform..

[15]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[16]  Andrew E. Jaffe,et al.  Bioinformatics Applications Note Gene Expression the Sva Package for Removing Batch Effects and Other Unwanted Variation in High-throughput Experiments , 2022 .

[17]  Jeffrey T Leek,et al.  A general framework for multiple testing dependence , 2008, Proceedings of the National Academy of Sciences.

[18]  Pedro Larrañaga,et al.  Filter versus wrapper gene selection approaches in DNA microarray domains , 2004, Artif. Intell. Medicine.

[19]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[20]  K. V. Donkena,et al.  Batch effect correction for genome-wide methylation data with Illumina Infinium platform , 2011, BMC Medical Genomics.

[21]  Peter Kraft,et al.  Quality control and quality assurance in genotypic data for genome‐wide association studies , 2010, Genetic epidemiology.

[22]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[23]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[24]  T. Sellers,et al.  Epidemiologic and genetic follo‐up study of 544 Minnesota breast cancer families: Design and methods , 1995, Genetic epidemiology.

[25]  Gary A. Churchill,et al.  Randomization in Laboratory Procedure Is Key to Obtaining Reproducible Microarray Results , 2008, PloS one.

[26]  Crispin J. Miller,et al.  The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets – improving meta-analysis and prediction of prognosis , 2008, BMC Medical Genomics.

[27]  Rafael A Irizarry,et al.  Frozen robust multiarray analysis (fRMA). , 2010, Biostatistics.