Stratified randomization controls better for batch effects in 450K methylation analysis: a cautionary tale

Background: Batch effects in DNA methylation microarray experiments can lead to spurious results if not properly handled during the plating of samples. Methods: Two pilot studies examining the association of DNA methylation patterns across the genome with obesity in Samoan men were investigated for chip- and row-specific batch effects. For each study, the DNA of 46 obese men and 46 lean men were assayed using Illumina's Infinium HumanMethylation450 BeadChip. In the first study (Sample One), samples from obese and lean subjects were examined on separate chips. In the second study (Sample Two), the samples were balanced on the chips by lean/obese status, age group, and census region. We used methylumi, watermelon, and limma R packages, as well as ComBat, to analyze the data. Principal component analysis and linear regression were, respectively, employed to identify the top principal components and to test for their association with the batches and lean/obese status. To identify differentially methylated positions (DMPs) between obese and lean males at each locus, we used a moderated t-test. Results: Chip effects were effectively removed from Sample Two but not Sample One. In addition, dramatic differences were observed between the two sets of DMP results. After “removing” batch effects with ComBat, Sample One had 94,191 probes differentially methylated at a q-value threshold of 0.05 while Sample Two had zero differentially methylated probes. The disparate results from Sample One and Sample Two likely arise due to the confounding of lean/obese status with chip and row batch effects. Conclusion: Even the best possible statistical adjustments for batch effects may not completely remove them. Proper study design is vital for guarding against spurious findings due to such effects.

[1]  D. Weeks,et al.  Prevalence of adiposity and associated cardiometabolic risk factors in the samoan genome‐wide association study , 2014, American journal of human biology : the official journal of the Human Biology Council.

[2]  C. Gieger,et al.  DNA methylation and body-mass index: a genome-wide analysis , 2014, The Lancet.

[3]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[4]  Brandilyn A. Peters,et al.  Batch Effects and Pathway Analysis: Two Potential Perils in Cancer Studies Involving DNA Methylation Array Analysis , 2013, Cancer Epidemiology, Biomarkers & Prevention.

[5]  Sarah R. Edmonson,et al.  High-resolution serum proteomic patterns for ovarian cancer detection. , 2004, Endocrine-related cancer.

[6]  J P Matts,et al.  Randomization in clinical trials: conclusions and recommendations. , 1988, Controlled clinical trials.

[7]  Rafael A. Irizarry,et al.  Personalized Epigenomic Signatures That Are Stable Over Time and Covary with Body Mass Index , 2010, Science Translational Medicine.

[8]  Jacob Cohen,et al.  A power primer. , 1992, Psychological bulletin.

[9]  Q. Hu,et al.  OSAT: a tool for sample-to-batch allocations in genomics experiments , 2012, BMC Genomics.

[10]  Xiao Zhang,et al.  Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis , 2010, BMC Bioinformatics.

[11]  John B. Willett,et al.  By Design: Planning Research on Higher Education , 1990 .

[12]  Kevin R Coombes,et al.  Run batch effects potentially compromise the usefulness of genomic signatures for ovarian cancer. , 2008, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[13]  Martin J. Aryee,et al.  Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in Rheumatoid Arthritis , 2013, Nature Biotechnology.

[14]  K. V. Donkena,et al.  Batch effect correction for genome-wide methylation data with Illumina Infinium platform , 2011, BMC Medical Genomics.

[15]  Huidong Shi,et al.  Obesity related methylation changes in DNA of peripheral blood leukocytes , 2010, BMC medicine.

[16]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[17]  L. Liotta,et al.  High-resolution serum proteomic patterns for ovarian cancer detection , 2004 .

[18]  A. Feinberg,et al.  Increased methylation variation in epigenetic domains across cancer types , 2011, Nature Genetics.

[19]  Martin J. Aryee,et al.  Personalized Epigenomic Signatures That Are Stable Over Time and Covary with Body Mass Index , 2010, Science Translational Medicine.

[20]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[21]  J. Tost,et al.  Complete pipeline for Infinium(®) Human Methylation 450K BeadChip data processing using subset quantile normalization for accurate DNA methylation estimation. , 2012, Epigenomics.

[22]  Epigenetic Events in Gastrointestinal Cancer , 2009, The American Journal of Gastroenterology.

[23]  A. Feinberg,et al.  Stochastic epigenetic variation as a driving force of development, evolutionary adaptation, and disease , 2010, Proceedings of the National Academy of Sciences.

[24]  Jeffrey T Leek,et al.  On the design and analysis of gene expression studies in human populations , 2007, Nature Genetics.

[25]  M. Sasiadek,et al.  Aberrant epigenetic patterns in the etiology of gastrointestinal cancers , 2010, Journal of Applied Genetics.

[26]  Stanley H. Cohen,et al.  Design and Analysis , 2010 .

[27]  Ruth Pidsley,et al.  A data-driven approach to preprocessing Illumina 450K methylation array data , 2013, BMC Genomics.

[28]  Peter A. Jones,et al.  Epigenetics in cancer. , 2010, Carcinogenesis.

[29]  Y. Benjamini,et al.  Multiple Hypotheses Testing with Weights , 1997 .

[30]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[31]  R. Spielman,et al.  Reply to “On the design and analysis of gene expression studies in human populations” , 2007, Nature Genetics.

[32]  Jeffrey S. Morris,et al.  The importance of experimental design in proteomic mass spectrometry experiments: some cautionary tales. , 2005, Briefings in functional genomics & proteomics.

[33]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[34]  A. Fisher,et al.  Balanced versus Randomized Field Experiments in Economics : Why W . S . Gosset aka “ Student ” Matters , 2014 .