Segmenting Eukaryotic Genomes with the Generalized Gibbs Sampler

Eukaryotic genomes display segmental patterns of variation in various properties, including GC content and degree of evolutionary conservation. DNA segmentation algorithms are aimed at identifying statistically significant boundaries between such segments. Such algorithms may provide a means of discovering new classes of functional elements in eukaryotic genomes. This paper presents a model and an algorithm for Bayesian DNA segmentation and considers the feasibility of using it to segment whole eukaryotic genomes. The algorithm is tested on a range of simulated and real DNA sequences, and the following conclusions are drawn. Firstly, the algorithm correctly identifies non-segmented sequence, and can thus be used to reject the null hypothesis of uniformity in the property of interest. Secondly, estimates of the number and locations of change-points produced by the algorithm are robust to variations in algorithm parameters and initial starting conditions and correspond to real features in the data. Thirdly, the algorithm is successfully used to segment human chromosome 1 according to GC content, thus demonstrating the feasibility of Bayesian segmentation of eukaryotic genomes. The software described in this paper is available from the author's website (www.uq.edu.au/ approximately uqjkeith/) or upon request to the author.

[1]  J. Mattick Challenging the dogma: the hidden layer of non-protein-coding RNAs in complex organisms. , 2003, BioEssays : news and reviews in molecular, cellular and developmental biology.

[2]  H. Müller,et al.  Statistical methods for DNA sequence segmentation , 1998 .

[3]  Richard J. Boys,et al.  On determining the order of Markov dependence of an observed process governed by a hidden Markov model , 2002, Sci. Program..

[4]  Adrian F. M. Smith,et al.  Sampling-Based Approaches to Calculating Marginal Densities , 1990 .

[5]  Terrence S. Furey,et al.  The UCSC Genome Browser Database: update 2006 , 2005, Nucleic Acids Res..

[6]  Walter R. Gilks,et al.  Full conditional distributions , 1995 .

[7]  Heikki Mannila,et al.  Genome segmentation using piecewise constant intensity models and reversible jump MCMC , 2002, ECCB.

[8]  Lee Whitmore,et al.  DICHROWEB: an interactive website for the analysis of protein secondary structure from circular dichroism spectra , 2002, Bioinform..

[9]  The Chinese Human Genome Sequencing Consortium,et al.  Complete sequence and gene map of a human major histocompatibility complex , 1999 .

[10]  Ramón Román-Roldán,et al.  Isochore chromosome maps of the human genome. , 2002, Gene.

[11]  Darren J. Wilkinson,et al.  Detecting homogeneous segments in DNA sequences by using hidden Markov models , 2000 .

[12]  Donald Geman,et al.  Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images , 1984 .

[13]  D. Haussler,et al.  Ultraconserved Elements in the Human Genome , 2004, Science.

[14]  Peter Adams,et al.  Inferring an Original Sequence from Erroneous Copies : A Bayesian Approach , 2003, APBC.

[15]  Gen Tamiya,et al.  Complete sequence and gene map of a human major histocompatibility complex , 1999 .

[16]  J. Mattick Non‐coding RNAs: the architects of eukaryotic complexity , 2001, EMBO reports.

[17]  Ramón Román-Roldán,et al.  SEGMENT: identifying compositional domains in DNA sequences , 1999, Bioinform..

[18]  Richard J Boys,et al.  A Bayesian Approach to DNA Sequence Segmentation , 2004, Biometrics.

[19]  Darryn Bryant,et al.  A Generalized Markov Sampler , 2004 .

[20]  P Bernaola-Galván,et al.  Isochore chromosome maps of eukaryotic genomes. , 2001, Gene.

[21]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[22]  H. Müller,et al.  Multiple changepoint fitting via quasilikelihood, with application to DNA sequence segmentation , 2000 .

[23]  Peter Adams,et al.  A simulated annealing algorithm for finding consensus sequences , 2002, Bioinform..

[24]  Terrence S. Furey,et al.  The UCSC Genome Browser Database , 2003, Nucleic Acids Res..

[25]  Jun S. Liu,et al.  Bayesian inference on biopolymer models , 1999, Bioinform..

[26]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[27]  Peter Adams,et al.  Sampling phylogenetic tree space with the generalized Gibbs sampler , 2005 .