Gene-set analysis is severely biased when applied to genome-wide methylation data

MOTIVATION DNA methylation is an epigenetic mark that can stably repress gene expression. Because of its biological and clinical significance, several methods have been developed to compare genome-wide patterns of methylation between groups of samples. The application of gene set analysis to identify relevant groups of genes that are enriched for differentially methylated genes is often a major component of the analysis of these data. This can be used, for example, to identify processes or pathways that are perturbed in disease development. We show that gene-set analysis, as it is typically applied to genome-wide methylation assays, is severely biased as a result of differences in the numbers of CpG sites associated with different classes of genes and gene promoters. RESULTS We demonstrate this bias using published data from a study of differential CpG island methylation in lung cancer and a dataset we generated to study methylation changes in patients with long-standing ulcerative colitis. We show that several of the gene sets that seem enriched would also be identified with randomized data. We suggest two existing approaches that can be adapted to correct the bias. Accounting for the bias in the lung cancer and ulcerative colitis datasets provides novel biological insights into the role of methylation in cancer development and chronic inflammation, respectively. Our results have significant implications for many previous genome-wide methylation studies that have drawn conclusions on the basis of such strongly biased analysis. CONTACT cathal.seoighe@nuigalway.ie SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  I. Korf,et al.  Large-scale methylation domains mark a functional subset of neuronally expressed genes. , 2011, Genome research.

[2]  C. Shi,et al.  Differential DNA Methylation Status Between Human Preadipocytes and Mature Adipocytes , 2012, Cell Biochemistry and Biophysics.

[3]  A. Oshlack,et al.  Transcript length bias in RNA-seq data confounds systems biology , 2009, Biology Direct.

[4]  I. Kohane,et al.  DNA hypermethylation in lung cancer is targeted at differentiation-associated genes , 2012, Oncogene.

[5]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[6]  Robert Gentleman,et al.  Using GOstats to test gene lists for GO term association , 2007, Bioinform..

[7]  Guido Marcucci,et al.  Quantitative DNA methylation analysis identifies a single CpG dinucleotide important for ZAP-70 expression and predictive of prognosis in chronic lymphocytic leukemia. , 2012, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[8]  Paul A. Khavari,et al.  DNMT1 Maintains Progenitor Function in Self-Renewing Somatic Tissue , 2010, Nature.

[9]  Martin J Aryee,et al.  Differential methylation of tissue- and cancer-specific CpG island shores distinguishes human induced pluripotent stem cells, embryonic stem cells and fibroblasts , 2009, Nature Genetics.

[10]  Andrew B. Nobel,et al.  Significance analysis of functional categories in gene expression studies: a structured permutation approach , 2005, Bioinform..

[11]  A. Feinberg,et al.  Genome-wide methylation analysis of human colon cancer reveals similar hypo- and hypermethylation at conserved tissue-specific CpG island shores , 2008, Nature Genetics.

[12]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[13]  H. Yoo,et al.  Functional switching of TGF-beta1 signaling in liver cancer via epigenetic modulation of a single CpG site in TTP promoter. , 2010, Gastroenterology.

[14]  Hui Zeng,et al.  Correlation between the single-site CpG methylation and expression silencing of the XAF1 gene in human gastric and colon cancers. , 2006, Gastroenterology.

[15]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[16]  Dirk Schübeler,et al.  Methylated DNA immunoprecipitation (MeDIP). , 2009, Methods in molecular biology.

[17]  Thomas L. Dunwell,et al.  A Genome-wide screen identifies frequently methylated genes in haematological and epithelial cancers , 2010, Molecular Cancer.

[18]  M. Goodisman,et al.  DNA methylation is widespread and associated with differential gene expression in castes of the honeybee, Apis mellifera , 2009, Proceedings of the National Academy of Sciences.

[19]  S. Balasubramanian,et al.  Quantitative Sequencing of 5-Methylcytosine and 5-Hydroxymethylcytosine at Single-Base Resolution , 2012, Science.

[20]  G. Deng,et al.  Methylation of CpG in a small region of the hMLH1 promoter invariably correlates with the absence of gene expression. , 1999, Cancer research.

[21]  K. Abrams,et al.  The risk of colorectal cancer in ulcerative colitis: a meta-analysis , 2001, Gut.

[22]  W. Lam,et al.  Chromosome-wide and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells , 2005, Nature Genetics.

[23]  Colm E. Nestor,et al.  Tissue of origin determines cancer-associated CpG island promoter hypermethylation patterns , 2012, Genome Biology.

[24]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[25]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[26]  G. Daley,et al.  Targeted bisulfite sequencing reveals changes in DNA methylation associated with nuclear reprogramming , 2009, Nature Biotechnology.

[27]  T Takahashi,et al.  The DNA methylation landscape of small cell lung cancer suggests a differentiation defect of neuroendocrine cells , 2013, Oncogene.

[28]  Reid F. Thompson,et al.  High-resolution genome-wide cytosine methylation profiling with simultaneous copy number analysis and optimization for limited cell numbers , 2009, Nucleic acids research.

[29]  Xueyan Zhong,et al.  High-resolution mapping of DNA hypermethylation and hypomethylation in lung cancer , 2008, Proceedings of the National Academy of Sciences.

[30]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[31]  V. Calhoun,et al.  A Study of the Influence of Sex on Genome Wide Methylation , 2010, PloS one.

[32]  Matthew D. Young,et al.  Gene ontology analysis for RNA-seq: accounting for selection bias , 2010, Genome Biology.

[33]  T. Ushijima,et al.  The presence of RNA polymerase II, active or stalled, predicts epigenetic fate of promoter CpG islands. , 2009, Genome research.

[34]  R. Tibshirani,et al.  On testing the significance of sets of genes , 2006, math/0610667.

[35]  Cory Y. McLean,et al.  GREAT improves functional interpretation of cis-regulatory regions , 2010, Nature Biotechnology.