New stopping criteria for segmenting DNA sequences.

We propose a solution on the stopping criterion in segmenting inhomogeneous DNA sequences with complex statistical patterns. This new stopping criterion is based on Bayesian information criterion in the model selection framework. When this criterion is applied to telomere of S. cerevisiae and the complete sequence of E. coli, borders of biologically meaningful units were identified, and a more reasonable number of domains was obtained. We also introduce a measure called segmentation strength which can be used to control the delineation of large domains. The relationship between the average domain size and the threshold of segmentation strength is determined for several genome sequences.

[1]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[2]  A. Hansen Statistical Models for the Fracture of Disordered Media , 1990 .

[3]  B. Dujon,et al.  The nucleotide sequence of Saccharomyces cerevisiae chromosome VII. , 1997, Nature.

[4]  C. Peng,et al.  Long-range correlations in nucleotide sequences , 1992, Nature.

[5]  Ben-Naim,et al.  Scale invariance and lack of self-averaging in fragmentation , 1999, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[6]  Wentian Li,et al.  The Study of Correlation Structures of DNA Sequences: A Critical Review , 1997, Comput. Chem..

[7]  Wentian Li,et al.  Long-range correlation and partial 1/fα spectrum in a noncoding DNA sequence , 1992 .

[8]  P. A. P. Moran,et al.  Theory of Probability.@@@An Introduction to Probability Theory.@@@The Analysis of Time Series: An Introduction. , 1985 .

[9]  M. Bhaskara Rao,et al.  Model Selection and Inference , 2000, Technometrics.

[10]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[11]  Ramón Román-Roldán,et al.  SEGMENT: identifying compositional domains in DNA sequences , 1999, Bioinform..

[12]  G. Bernardi,et al.  The human genome: organization and evolutionary history. , 1995, Annual review of genetics.

[13]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[14]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[15]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[16]  Wentian Li The Measure of Compositional Heterogeneity in DNA Sequences Is Related to Measures of Complexity , 1997, adap-org/9709007.

[17]  C. D. Litton,et al.  Theory of Probability (3rd Edition) , 1984 .

[18]  Wentian Li,et al.  GENERATING NONTRIVIAL LONG-RANGE CORRELATIONS AND 1/f SPECTRA BY REPLICATION AND MUTATION , 1992 .

[19]  H. Jeffreys,et al.  Theory of probability , 1896 .

[20]  Mikhail A. Roytberg,et al.  DNA Segmentation Through the Bayesian Approach , 2000, J. Comput. Biol..

[21]  J. Muzy,et al.  Long-range correlations in genomic DNA: a signature of the nucleosomal structure. , 2001, Physical review letters.

[22]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[23]  Pedro Carpena,et al.  Statistical characterization of the mobility edge of vibrational states in disordered materials , 1999 .

[24]  Wentian Li The complexity of DNA , 1997 .

[25]  H E Stanley,et al.  Finding borders between coding and noncoding DNA regions by an entropic segmentation method. , 2000, Physical review letters.

[26]  H. Müller,et al.  Statistical methods for DNA sequence segmentation , 1998 .

[27]  P. Bernaola-Galván,et al.  Compositional segmentation and long-range fractal correlations in DNA sequences. , 1996, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[28]  D. Siegmund,et al.  Tests for a change-point , 1987 .

[29]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[30]  J. Oliver,et al.  Sequence Compositional Complexity of DNA through an Entropic Segmentation Method , 1998 .

[31]  J. Broach,et al.  Genome dynamics, protein synthesis, and energetics , 1991 .

[32]  Alain Arneodo,et al.  Long-Range Correlations in Genomic DNA , 2001 .