Identification of coding and non-coding sequences using local Hölder exponent formalism

MOTIVATION Accurate prediction of genes in genomes has always been a challenging task for bioinformaticians and computational biologists. The discovery of existence of distinct scaling relations in coding and non-coding sequences has led to new perspectives in the understanding of the DNA sequences. This has motivated us to exploit the differences in the local singularity distributions for characterization and classification of coding and non-coding sequences. RESULTS The local singularity density distribution in the coding and non-coding sequences of four genomes was first estimated using the wavelet transform modulus maxima methodology. Support vector machines classifier was then trained with the extracted features. The trained classifier is able to provide an average test accuracy of 97.7%. The local singularity features in a DNA sequence can be exploited for successful identification of coding and non-coding sequences. CONTACT Available on request from bd.kulkarni@ncl.res.in.

[1]  A. D. McLachlan,et al.  Codon preference and its use in identifying protein coding regions in long DNA sequences , 1982, Nucleic Acids Res..

[2]  C. A. Chatzidimitriou-Dreismann,et al.  Long-range correlations in DNA , 1993, Nature.

[3]  R. Linsker,et al.  A measure of DNA periodicity. , 1986, Journal of theoretical biology.

[4]  E. Bacry,et al.  Nucleotide composition effects on the long-range correlations in human genes , 1998 .

[5]  V. V. Prabhu,et al.  Correlations in intronless DNA , 1992, Nature.

[6]  Emmanuel Bacry,et al.  THE THERMODYNAMICS OF FRACTALS REVISITED WITH WAVELETS , 1995 .

[7]  J. C. Shepherd Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Michael Ruogu Zhang,et al.  Identification of protein coding regions in the human genome by quadratic discriminant analysis. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Quan Pan,et al.  Classification of protein quaternary structure with support vector machine , 2003, Bioinform..

[10]  S. Mallat A wavelet tour of signal processing , 1998 .

[11]  C. Zhang,et al.  Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve. , 2000, Nucleic acids research.

[12]  Anna Tramontano,et al.  Probability of coding of a DNA sequence: an algorithm to predict translated reading frames from their thermodynamic characteristics , 1986, Nucleic Acids Res..

[13]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[14]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[15]  J. Fickett Recognition of protein coding regions in DNA sequences. , 1982, Nucleic acids research.

[16]  José Manuel Gutiérrez,et al.  Multifractal analysis of DNA sequences using a novel chaos-game representation , 2001 .

[17]  Jason Weston,et al.  Gene functional classification from heterogeneous data , 2001, RECOMB.

[18]  Alain Arneodo,et al.  Long-Range Correlations in Genomic DNA , 2001 .

[19]  K. Lau,et al.  Recognition of an organism from fragments of its complete genome. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[20]  David R. Westhead,et al.  Improved prediction of protein-protein binding sites using a support vector machines approach. , 2005, Bioinformatics.

[21]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[22]  C. Peng,et al.  Long-range correlations in nucleotide sequences , 1992, Nature.

[23]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[24]  E. Uberbacher,et al.  Discovering and understanding genes in human DNA sequence using GRAIL. , 1996, Methods in enzymology.

[25]  William H. Cooke,et al.  Influence of Progressive Central Hypovolemia on Hölder Exponent Distributions of Cardiac Interbeat Intervals , 2004, Annals of Biomedical Engineering.

[26]  J. Muzy,et al.  Long-range correlations in genomic DNA: a signature of the nucleosomal structure. , 2001, Physical review letters.

[27]  Zu-Guo Yu,et al.  Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses. , 2004, Journal of theoretical biology.

[28]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[29]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[30]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[31]  K. Lau,et al.  Measure representation and multifractal analysis of complete genomes. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[32]  C. Zhang,et al.  Identification of protein-coding genes in the genome of Vibrio cholerae with more than 98% accuracy using occurrence frequencies of single nucleotides. , 2001, European journal of biochemistry.

[33]  I Sauvaget,et al.  K-tuple frequency analysis: from intron/exon discrimination to T-cell epitope mapping. , 1990, Methods in enzymology.

[34]  Anders Gorm Pedersen,et al.  Neural Network Prediction of Translation Initiation Sites in Eukaryotes: Perspectives for EST and Genome Analysis , 1997, ISMB.

[35]  H E Stanley,et al.  Finding borders between coding and noncoding DNA regions by an entropic segmentation method. , 2000, Physical review letters.

[36]  Truong Q. Nguyen,et al.  Wavelets and filter banks , 1996 .

[37]  Zbigniew R. Struzik Removing divergences in the negative moments of the multi-fractal partition function with the wavelet transformation , 1998 .

[38]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[39]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[40]  Z. Struzik Determining Local Singularity Strengths and their Spectra with the Wavelet Transform , 2000 .

[41]  Bruce J. West,et al.  Hölder exponent spectra for human gait , 2002, cond-mat/0208028.

[42]  E. Bacry,et al.  The Multifractal Formalism Revisited with Wavelets , 1994 .

[43]  C. Zhang,et al.  A graphic approach to analyzing codon usage in 1562 Escherichia coli protein coding sequences. , 1994, Journal of molecular biology.

[44]  P Argos,et al.  Oligopeptide biases in protein sequences and their use in predicting protein coding regions in nucleotide sequences , 1988, Proteins.

[45]  Zbigniew R. Struzik,et al.  Wavelet transform based multifractal formalism in outlier detection and localisation for financial time series , 2002 .

[46]  Zu-Guo Yu,et al.  Multifractal and correlation analyses of protein sequences from complete genomes. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[47]  Stéphane Mallat,et al.  Singularity detection and processing with wavelets , 1992, IEEE Trans. Inf. Theory.

[48]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[49]  Bernard F. Buxton,et al.  Secondary structure prediction with support vector machines , 2003, Bioinform..

[50]  Zhirong Sun,et al.  Support vector machine approach for protein subcellular localization prediction , 2001, Bioinform..

[51]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[52]  Stéphane Mallat,et al.  A Wavelet Tour of Signal Processing, 2nd Edition , 1999 .

[53]  Li-Qian Zhou,et al.  A fractal method to distinguish coding and non-coding sequences in a complete genome based on a number sequence representation. , 2005, Journal of theoretical biology.

[54]  Dennis R. Burton,et al.  Antibodies from libraries , 1992, Nature.

[55]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[56]  R Zhang,et al.  Analysis of distribution of bases in the coding sequences by a diagrammatic technique. , 1991, Nucleic acids research.