Delineating relative homogeneous G+C domains in DNA sequences.

The concept of homogeneity of G+C content is always relative and subjective. This point is emphasized and quantified in this paper using a simple example of one sequence segmented into two subsequences. Whether the sequence is homogeneous or not can be answered by whether the two-subsequence model describes the DNA sequence better than the one-sequence model. There are at least three equivalent ways of looking at the 1-to-2 segmentation: Jensen-Shannon divergence measure, log likelihood ratio test, and model selection using Bayesian information criterion. Once a criterion is chosen, a DNA sequence can be recursively segmented into multiple domains. We use one subjective criterion called segmentation strength based on the Bayesian information criterion. Whether or not a sequence is homogeneous and how many domains it has depend on this criterion. We compare six different genome sequences (yeast S. cerevisiae chromosome III and IV, bacterium M. pneumoniae, human major histocompatibility complex sequence, longest contigs in human chromosome 21 and 22) by recursive segmentations at different strength criteria. Results by recursive segmentation confirm that yeast chromosome IV is more homogeneous than yeast chromosome III, human chromosome 21 is more homogeneous than human chromosome 22, and bacterial genomes may not be homogeneous due to short segments with distinct base compositions. The recursive segmentation also provides a quantitative criterion for identifying isochores in human sequences. Some features of our recursive segmentation, such as the possibility of delineating domain borders accurately, are superior to those of the moving-window approach commonly used in such analyses.

[1]  J. Biggins Testing Statistical Hypotheses , 1988 .

[2]  A. Nekrutenko,et al.  Assessment of compositional heterogeneity within and between eukaryotic genomes. , 2000, Genome research.

[3]  David R. Wolf,et al.  Base compositional structure of genomes. , 1992, Genomics.

[4]  Wentian Li The Measure of Compositional Heterogeneity in DNA Sequences Is Related to Measures of Complexity , 1997, adap-org/9709007.

[5]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[6]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[7]  B. Dujon The yeast genome project: what did we learn? , 1996, Trends in genetics : TIG.

[8]  D Häring,et al.  No isochores in the human chromosomes 21 and 22? , 2001, Biochemical and biophysical research communications.

[9]  D. Siegmund,et al.  Tests for a change-point , 1987 .

[10]  Wentian Li,et al.  DNA segmentation as a model selection process , 2001, RECOMB.

[11]  Gen Tamiya,et al.  Complete sequence and gene map of a human major histocompatibility complex , 1999 .

[12]  M. Hattori,et al.  The DNA sequence of human chromosome 21 , 2000, Nature.

[13]  David R. Anderson,et al.  Model selection and inference : a practical information-theoretic approach , 2000 .

[14]  G. Churchill Stochastic models for heterogeneous DNA sequences. , 1989, Bulletin of mathematical biology.

[15]  André Goffeau,et al.  The nucleotide sequence of chromosome IV from Saccharomyces cerevisiae , 1997 .

[16]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[17]  J. Oliver,et al.  Sequence Compositional Complexity of DNA through an Entropic Segmentation Method , 1998 .

[18]  P. Sharp,et al.  G+C content variation along and among Saccharomyces cerevisiae chromosomes. , 1999, Molecular biology and evolution.

[19]  H. Hilbert,et al.  Complete sequence analysis of the genome of the bacterium Mycoplasma pneumoniae. , 1996, Nucleic acids research.

[20]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[21]  A Ando,et al.  A boundary of long-range G + C% mosaic domains in the human MHC locus: pseudoautosomal boundary-like sequence exists near the boundary. , 1995, Genomics.

[22]  G. Bernardi,et al.  The isochore organization of the human genome. , 1989, Annual review of genetics.

[23]  C. Peng,et al.  Long-range correlations in nucleotide sequences , 1992, Nature.

[24]  Wentian Li,et al.  Understanding long-range correlations in DNA sequences , 1994, chao-dyn/9403002.

[25]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[26]  Ivo Grosse,et al.  Applications of Recursive Segmentation to the Analysis of DNA Sequences , 2002, Comput. Chem..

[27]  B. Dujon,et al.  The complete DNA sequence of yeast chromosome III , 1992, Nature.

[28]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[29]  B. Dujon,et al.  The nucleotide sequence of Saccharomyces cerevisiae chromosome IV. , 1997, Nature.

[30]  G. Bernardi,et al.  The human genome: organization and evolutionary history. , 1995, Annual review of genetics.

[31]  S Beck,et al.  Gene organisation, sequence variation and isochore structure at the centromeric boundary of the human MHC. , 1999, Journal of molecular biology.

[32]  Ramón Román-Roldán,et al.  SEGMENT: identifying compositional domains in DNA sequences , 1999, Bioinform..

[33]  W Li,et al.  New stopping criteria for segmenting DNA sequences. , 2001, Physical review letters.

[34]  Y. Nakamura,et al.  Human pseudoautosomal boundary-like sequences: expression and involvement in evolutionary formation of the present-day pseudoautosomal boundary of human sex chromosomes. , 1996, Human molecular genetics.

[35]  B. Dujon,et al.  The nucleotide sequence of Saccharomyces cerevisiae chromosome VII. , 1997, Nature.

[36]  H E Stanley,et al.  Finding borders between coding and noncoding DNA regions by an entropic segmentation method. , 2000, Physical review letters.

[37]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[38]  G Bernardi,et al.  The major components of the mouse and human genomes. 2. Reassociation kinetics. , 1981, European journal of biochemistry.

[39]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[40]  Wentian Li,et al.  Long-range correlation and partial 1/fα spectrum in a noncoding DNA sequence , 1992 .

[41]  M. Kendall Theoretical Statistics , 1956, Nature.

[42]  G Bernardi,et al.  The major components of the mouse and human genomes. 1. Preparation, basic properties and compositional heterogeneity. , 1981, European journal of biochemistry.

[43]  N. Sueoka On the genetic basis of variation and heterogeneity of DNA base composition. , 1962, Proceedings of the National Academy of Sciences of the United States of America.

[44]  W Li,et al.  Compositional heterogeneity within, and uniformity between, DNA sequences of yeast chromosomes. , 1998, Genome research.

[45]  Wentian Li,et al.  The Study of Correlation Structures of DNA Sequences: A Critical Review , 1997, Comput. Chem..

[46]  I. Grosse,et al.  Analysis of symbolic sequences using the Jensen-Shannon divergence. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[47]  Anthony N. Pettitt,et al.  A simple cumulative sum type statistic for the change-point problem with zero-one observations , 1980 .

[48]  Wentian Li The complexity of DNA , 1997 .

[49]  P. Sharp,et al.  Regional base composition variation along yeast chromosome III: evolution of chromosome primary structure. , 1993, Nucleic acids research.

[50]  R. Royall Statistical Evidence: A Likelihood Paradigm , 1997 .

[51]  E. Lehmann Testing Statistical Hypotheses , 1960 .

[52]  Wentian Li,et al.  GENERATING NONTRIVIAL LONG-RANGE CORRELATIONS AND 1/f SPECTRA BY REPLICATION AND MUTATION , 1992 .

[53]  David R. Anderson,et al.  Model Selection and Multimodel Inference , 2003 .

[54]  P. Bernaola-Galván,et al.  Compositional segmentation and long-range fractal correlations in DNA sequences. , 1996, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[55]  Melanie E. Goward,et al.  The DNA sequence of human chromosome 22 , 1999, Nature.