Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction

MOTIVATION Sample source, procurement process and other technical variations introduce batch effects into genomics data. Algorithms to remove these artifacts enhance differences between known biological covariates, but also carry potential concern of removing intragroup biological heterogeneity and thus any personalized genomic signatures. As a result, accurate identification of novel subtypes from batch-corrected genomics data is challenging using standard algorithms designed to remove batch effects for class comparison analyses. Nor can batch effects be corrected reliably in future applications of genomics-based clinical tests, in which the biological groups are by definition unknown a priori. RESULTS Therefore, we assess the extent to which various batch correction algorithms remove true biological heterogeneity. We also introduce an algorithm, permuted-SVA (pSVA), using a new statistical model that is blind to biological covariates to correct for technical artifacts while retaining biological heterogeneity in genomic data. This algorithm facilitated accurate subtype identification in head and neck cancer from gene expression data in both formalin-fixed and frozen samples. When applied to predict Human Papillomavirus (HPV) status, pSVA improved cross-study validation even if the sample batches were highly confounded with HPV status in the training set. AVAILABILITY AND IMPLEMENTATION All analyses were performed using R version 2.15.0. The code and data used to generate the results of this manuscript is available from https://sourceforge.net/projects/psva.

[1]  Johann A. Gagnon-Bartsch,et al.  Using control genes to correct for unwanted variation in microarray data. , 2012, Biostatistics.

[2]  Jie Ding,et al.  CoGAPS: an R/C++ package to identify patterns and biological process activity in transcriptomic data , 2010, Bioinform..

[3]  Andrew E. Jaffe,et al.  Bioinformatics Applications Note Gene Expression the Sva Package for Removing Batch Effects and Other Unwanted Variation in High-throughput Experiments , 2022 .

[4]  K. V. Donkena,et al.  Batch effect correction for genome-wide methylation data with Illumina Infinium platform , 2011, BMC Medical Genomics.

[5]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[6]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[7]  M. Newton,et al.  Fundamental differences in cell cycle deregulation in human papillomavirus-positive and human papillomavirus-negative head/neck and cervical cancers. , 2007, Cancer research.

[8]  Joel Parker,et al.  Insulin‐like growth factor‐1 receptor inhibitor, AMG‐479, in cetuximab‐refractory head and neck squamous cell carcinoma , 2011, Head & neck.

[9]  J. Parker,et al.  A feed-forward loop involving protein kinase Calpha and microRNAs regulates tumor cell cycle. , 2009, Cancer research.

[10]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[11]  Jonathan M Blackburn,et al.  Quality assessment and data handling methods for Affymetrix Gene 1.0 ST arrays with variable RNA integrity , 2013, BMC Genomics.

[12]  Chris Sander,et al.  Emerging landscape of oncogenic signatures across human cancers , 2013, Nature Genetics.

[13]  Yu Shyr,et al.  Gene Expression Differences Associated with Human Papillomavirus Status in Head and Neck Squamous Cell Carcinoma , 2006, Clinical Cancer Research.

[14]  C. R. Leemans,et al.  A novel algorithm for reliable detection of human papillomavirus in paraffin embedded head and neck cancer specimen , 2007, International journal of cancer.

[15]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[16]  Christopher R. Cabanski,et al.  Molecular Subtypes in Head and Neck Cancer Exhibit Distinct Patterns of Chromosomal Gain and Loss of Canonical Cancer Genes , 2013, PloS one.

[17]  Jeffrey T Leek,et al.  Statistical Applications in Genetics and Molecular Biology The practical effect of batch on genomic prediction , 2012 .

[18]  Debashis Ghosh,et al.  COPA - cancer outlier profile analysis , 2006, Bioinform..

[19]  A J Cmelak,et al.  Nuclear factor-kappa B pathway and response in a phase II trial of bortezomib and docetaxel in patients with recurrent and/or metastatic head and neck squamous cell carcinoma. , 2010, Annals of oncology : official journal of the European Society for Medical Oncology.

[20]  C. Perou,et al.  Molecular classification of head and neck squamous cell carcinomas using patterns of gene expression. , 2004, Cancer cell.

[21]  Kevin P. White,et al.  Genomic profiling of kinase genes in head and neck squamous cell carcinomas to identify potentially targetable genetic aberrations in FGFR1/2, DDR2, EPHA2, and PIK3CA. , 2013 .

[22]  J. Leek,et al.  Temporal dynamics and genetic control of transcription in the human prefrontal cortex , 2011, Nature.

[23]  Jeffrey T. Leek,et al.  Removing batch effects for prediction problems with frozen surrogate variable analysis , 2013, PeerJ.

[24]  Rafael A Irizarry,et al.  Frozen robust multiarray analysis (fRMA). , 2010, Biostatistics.

[25]  R. Shaw,et al.  Refining the diagnosis of oropharyngeal squamous cell carcinoma using human papillomavirus testing. , 2010, Oral oncology.

[26]  Christine H Chung,et al.  Phase 2 trial of oxaliplatin and pemetrexed as an induction regimen in locally advanced head and neck cancer , 2012, Cancer.