A fractal method to distinguish coding and non-coding sequences in a complete genome based on a number sequence representation.

A fractal method to distinguish coding and non-coding sequences in a complete genome is proposed, based on different statistical behaviors between these two kinds of sequences. We first propose a number sequence representation of DNA sequences. Multifractal analysis is then performed on the measure representation of the obtained number sequence. The three exponents C(-1), C1 and C2 are selected from the result of multifractal analysis. Each DNA may be represented by a point in the three-dimensional space generated by these three-component vectors. It is shown that points corresponding to coding and non-coding sequences in the complete genome of many prokaryotes are roughly distributed in different regions. Fisher's discriminant algorithm can be used to separate these two regions in the spanned space. If the point (C(-1),C1,C2) for a DNA sequence is situated in the region corresponding to coding sequences, the sequence is discriminated as a coding sequence; otherwise, the sequence is classified as a non-coding one. For all 51 prokaryotes we considered , the average discriminant accuracies pc,pnc,qc and qnc reach 72.28%, 84.65%, 72.53% and 84.18%, respectively.

[1]  R Zhang,et al.  A novel approach to distinguish between intron-containing and intronless genes based on the format of Z curves. , 1998, Journal of theoretical biology.

[2]  J. Qi,et al.  Whole Proteome Prokaryote Phylogeny Without Sequence Alignment: A K-String Composition Approach , 2003, Journal of Molecular Evolution.

[3]  Enrique Canessa,et al.  MULTIFRACTALITY IN TIME SERIES , 2000, cond-mat/0004170.

[4]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[5]  B. Dujon,et al.  The genomic tree as revealed from whole proteome comparisons. , 1999, Genome research.

[6]  Zu-Guo Yu,et al.  Multifractal and correlation analyses of protein sequences from complete genomes. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[7]  R. Mantegna,et al.  Long-range correlation properties of coding and noncoding DNA sequences: GenBank analysis. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[8]  M. Gerstein,et al.  Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing genomes on different levels. , 2000, Genome research.

[9]  D. Sankoff,et al.  Gene order comparisons for phylogenetic inference: evolution of the mitochondrial genome. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[10]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[11]  R. Fleischmann,et al.  The Minimal Gene Complement of Mycoplasma genitalium , 1995, Science.

[12]  C. Peng,et al.  Long-range correlations in nucleotide sequences , 1992, Nature.

[13]  H. G. E. Hentschel,et al.  The infinite number of generalized dimensions of fractals and strange attractors , 1983 .

[14]  Wentian Li,et al.  Understanding long-range correlations in DNA sequences , 1994, chao-dyn/9403002.

[15]  Liaofu Luo,et al.  STATISTICAL CORRELATION OF NUCLEOTIDES IN A DNA SEQUENCE , 1998 .

[16]  Zu-Guo Yu,et al.  Rescaled range and transition matrix analysis of DNA sequences , 1999 .

[17]  Zu-Guo Yu,et al.  Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses. , 2004, Journal of theoretical biology.

[18]  José Manuel Gutiérrez,et al.  Multifractal analysis of DNA sequences using a novel chaos-game representation , 2001 .

[19]  V. V. Prabhu,et al.  Correlations in intronless DNA , 1992, Nature.

[20]  Zu-Guo Yu,et al.  Multifractal Characterisation of Length Sequences of Coding and Noncoding Segments in a Complete Genome , 2001, physics/0108053.

[21]  Steve Baker,et al.  Integrated gene and species phylogenies from unaligned whole genome protein sequences , 2002, Bioinform..

[22]  S. Fitz-Gibbon,et al.  Whole genome-based phylogenetic analysis of free-living microorganisms. , 1999, Nucleic acids research.

[23]  C. A. Chatzidimitriou-Dreismann,et al.  Long-range correlations in DNA , 1993, Nature.

[24]  R Zhang,et al.  Z curves, an intutive tool for visualizing and analyzing the DNA sequences. , 1994, Journal of biomolecular structure & dynamics.

[25]  Zu-Guo Yu,et al.  Distance, correlation and mutual information among portraits of organisms based on complete genomes , 2001 .

[26]  David G. Stork,et al.  Pattern Classification , 1973 .

[27]  K. Lau,et al.  Measure representation and multifractal analysis of complete genomes. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[28]  Jensen,et al.  Erratum: Fractal measures and their singularities: The characterization of strange sets , 1986, Physical review. A, General physics.

[29]  Richard F. Voss,et al.  LONG-RANGE FRACTAL CORRELATIONS IN DNA INTRONS AND EXONS , 1994 .

[30]  K. Lau,et al.  Recognition of an organism from fragments of its complete genome. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.