Statistical Applications in Genetics and Molecular Biology Determining Coding CpG Islands by Identifying Regions Significant for Pattern Statistics on Markov Chains

Recent experimental and computational work confirms that CpGs can be unmethylated inside coding exons, thereby showing that codons may be subjected to both genomic and epigenomic constraint. It is therefore of interest to identify coding CpG islands (CCGIs) that are regions inside exons enriched for CpGs. The difficulty in identifying such islands is that coding exons exhibit sequence biases determined by codon usage and constraints that must be taken into account. We present a method for finding CCGIs that showcases a novel approach we have developed for identifying regions of interest that are significant (with respect to a Markov chain) for the counts of any pattern. Our method begins with the exact computation of tail probabilities for the number of CpGs in all regions contained in coding exons, and then applies a greedy algorithm for selecting islands from among the regions. We show that the greedy algorithm provably optimizes a biologically motivated criterion for selecting islands while controlling the false discovery rate. We applied this approach to the human genome (hg18) and annotated CpG islands in coding exons. The statistical criterion we apply to evaluating islands reduces the number of false positives in existing annotations, while our approach to defining islands reveals significant numbers of undiscovered CCGIs in coding exons. Many of these appear to be examples of functional epigenetic specialization in coding exons.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  Lior Pachter,et al.  MetMap Enables Genome-Scale Methyltyping for Determining Methylation States in Populations , 2010, PLoS Comput. Biol..

[3]  D. Brutlag,et al.  A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[4]  R. Reinhardt,et al.  DNA Methylation Analysis of Chromosome 21 Gene Promoters at Single Base Pair and Single Allele Resolution , 2009, PLoS genetics.

[5]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[6]  Israel Steinfeld,et al.  Developmental programming of CpG island methylation profiles in the human genome , 2009, Nature Structural &Molecular Biology.

[7]  M. Borodovsky,et al.  Detection of new genes in a bacterial genome using Markov models for three gene classes. , 1995, Nucleic acids research.

[8]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[9]  Mireille Régnier,et al.  On Pattern Frequency Occurrences in a Markovian Sequence , 1998, Algorithmica.

[10]  Gregory Nuel Numerical Solutions for Patterns Statistics on Markov Chains , 2006, Statistical applications in genetics and molecular biology.

[11]  Michael Q. Zhang,et al.  Comparison of sequencing-based methods to profile DNA methylation and identification of monoallelic epigenetic modifications , 2010, Nature Biotechnology.

[12]  Dominique Mouchiroud,et al.  CpGProD: identifying CpG islands associated with transcription start sites in large genomic mammalian sequences , 2002, Bioinform..

[13]  N. Vergne Drifting Markov Models with Polynomial Drift and Applications to DNA Sequences , 2008, Statistical applications in genetics and molecular biology.

[14]  Harris A. Jaffee,et al.  Redefining CpG islands using hidden Markov models. , 2010, Biostatistics.

[15]  Thomas Lengauer,et al.  CpG Island Mapping by Epigenome Prediction , 2007, PLoS Comput. Biol..

[16]  Michael B. Stadler,et al.  Distribution, silencing potential and evolutionary impact of promoter DNA methylation in the human genome , 2007, Nature Genetics.

[17]  Kenta Nakai,et al.  DBTSS: database of transcription start sites, progress report 2008 , 2007, Nucleic Acids Res..

[18]  R. Ivarie,et al.  The effect of codon usage on the oligonucleotide composition of the E. coli genome and identification of over- and underrepresented sequences by Markov chain analysis. , 1987, Nucleic acids research.

[19]  P. Bühlmann,et al.  Variable Length Markov Chains: Methodology, Computing, and Software , 2004 .

[20]  G. Churchill Stochastic models for heterogeneous DNA sequences. , 1989, Bulletin of mathematical biology.

[21]  M S Gelfand,et al.  Prediction of function in DNA sequence analysis. , 1995, Journal of computational biology : a journal of computational molecular cell biology.

[22]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[23]  Emanuele Raineri,et al.  Faster exact Markovian probability functions for motif occurrences: a DFA-only approach , 2008, Bioinform..

[24]  Grégory Nuel,et al.  Effective p-value computations using Finite Markov Chain Imbedding (FMCI): application to local score and to pattern statistics , 2006, Algorithms for Molecular Biology.

[25]  Anne H. O'Donnell,et al.  Chromatin and sequence features that define the fine and gross structure of genomic methylation patterns. , 2010, Genome research.

[26]  Terrence S. Furey,et al.  The UCSC Table Browser data retrieval tool , 2004, Nucleic Acids Res..

[27]  José Martínez-Aroza,et al.  CpGcluster: a distance-based algorithm for CpG-island detection , 2006, BMC Bioinformatics.

[28]  Philippe Flajolet,et al.  Motif statistics , 1999, Theor. Comput. Sci..

[29]  C Eng,et al.  Excessive CpG island hypermethylation in cancer cell lines versus primary human malignancies. , 2001, Human molecular genetics.

[30]  Fushing Hsieh,et al.  A Nearly Exhaustive Search for CpG Islands on Whole Chromosomes , 2009, The international journal of biostatistics.

[31]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[32]  Peter A. Jones,et al.  The Epigenomics of Cancer , 2007, Cell.

[33]  John M. Greally,et al.  CG dinucleotide clustering is a species-specific property of the genome , 2007, Nucleic acids research.

[34]  B. Blaisdell,et al.  Markov chain analysis finds a significant influence of neighboring bases on the occurrence of a base in eucaryotic nuclear DNA sequences both protein-coding and noncoding , 1985, Journal of Molecular Evolution.

[35]  CpG Islands Detector: a Window-based CpG Island Search Tool , 2010 .

[36]  Asai Asaithambi,et al.  CpGIF: an algorithm for the identification of CpG islands , 2008, Bioinformation.

[37]  A. Feinberg,et al.  A species-generalized probabilistic model-based definition of CpG islands , 2009, Mammalian Genome.

[38]  Daiya Takai,et al.  Comprehensive analysis of CpG islands in human chromosomes 21 and 22 , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Lee E. Edsall,et al.  Human DNA methylomes at base resolution show widespread epigenomic differences , 2009, Nature.

[40]  Robert S Illingworth,et al.  CpG islands – ‘A rough guide’ , 2009, FEBS letters.

[41]  V. Iyer,et al.  FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates active regulatory elements from human chromatin. , 2007, Genome research.

[42]  Jim Stalker,et al.  A Novel CpG Island Set Identifies Tissue-Specific Methylation at Developmental Gene Loci , 2008, PLoS biology.

[43]  Kevin Atteson,et al.  Calculating the Exact Probability of Language-Like Patterns in Biomolecular Sequences , 1998, ISMB.

[44]  Allen D. Delaney,et al.  Conserved Role of Intragenic DNA Methylation in Regulating Alternative Promoters , 2010, Nature.

[45]  E. O. Ermakova,et al.  Intergenic, gene terminal, and intragenic CpG islands in the human genome , 2010, BMC Genomics.

[46]  Sven Rahmann,et al.  Probabilistic Arithmetic Automata and Their Application to Pattern Matching Statistics , 2008, CPM.

[47]  H Almagor,et al.  A Markov analysis of DNA sequences. , 1983, Journal of theoretical biology.

[48]  A. Feinberg,et al.  Genome-wide methylation analysis of human colon cancer reveals similar hypo- and hypermethylation at conserved tissue-specific CpG island shores , 2008, Nature Genetics.

[49]  Yong Wang,et al.  An evaluation of new criteria for CpG islands in the human genome as gene markers , 2004, Bioinform..

[50]  Sergio Branciamore,et al.  CpG island clusters and pro-epigenetic selection for CpGs in protein-coding exons of HOX and other transcription factors , 2010, Proceedings of the National Academy of Sciences.

[51]  M. Frommer,et al.  CpG islands in vertebrate genomes. , 1987, Journal of molecular biology.