Genome analysis with inter-nucleotide distances

Motivation: DNA sequences can be represented by sequences of four symbols, but it is often useful to convert the symbols into real or complex numbers for further analysis. Several mapping schemes have been used in the past, but they seem unrelated to any intrinsic characteristic of DNA. The objective of this work was to find a mapping scheme directly related to DNA characteristics and that would be useful in discriminating between different species. Mathematical models to explore DNA correlation structures may contribute to a better knowledge of the DNA and to find a concise DNA description. Results: We developed a methodology to process DNA sequences based on inter-nucleotide distances. Our main contribution is a method to obtain genomic signatures for complete genomes, based on the inter-nucleotide distances, that are able to discriminate between different species. Using these signatures and hierarchical clustering, it is possible to build phylogenetic trees. Phylogenetic trees lead to genome differentiation and allow the inference of phylogenetic relations. The phylogenetic trees generated in this work display related species close to each other, suggesting that the inter-nucleotide distances are able to capture essential information about the genomes. To create the genomic signature, we construct a vector which describes the inter-nucleotide distance distribution of a complete genome and compare it with the reference distance distribution, which is the distribution of a sequence where the nucleotides are placed randomly and independently. It is the residual or relative error between the data and the reference distribution that is used to compare the DNA sequences of different organisms. Contact: vera@ua.pt

[1]  Wei Wang,et al.  Computing linear transforms of symbolic signals , 2002, IEEE Trans. Signal Process..

[2]  Paul Dan Cristea,et al.  Large scale features in DNA genomic signals , 2003, Signal Process..

[3]  Dimitris Anastassiou,et al.  Genomic signal processing , 2001, IEEE Signal Process. Mag..

[4]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[5]  R Zhang,et al.  Z curves, an intutive tool for visualizing and analyzing the DNA sequences. , 1994, Journal of biomolecular structure & dynamics.

[6]  R. Linsker,et al.  A measure of DNA periodicity. , 1986, Journal of theoretical biology.

[7]  T. Mahalakshmi,et al.  Visualization Of Genomic Data Using Inter-Nucleotide Distance Signals , 2005 .

[8]  P.D. Cristea,et al.  Genomic signal processing , 2004, 7th Seminar on Neural Network Applications in Electrical Engineering, 2004. NEUREL 2004. 2004.

[9]  E. Birney,et al.  Approaches to comparative sequence analysis: towards a functional view of vertebrate genomes , 2008, Nature Reviews Genetics.

[10]  Mahmood Akhtar,et al.  Signal Processing in Sequence Analysis: Advances in Eukaryotic Gene Prediction , 2008, IEEE Journal of Selected Topics in Signal Processing.

[11]  Vera Afreixo,et al.  Spectrum and symbol distribution of nucleotide sequences. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[12]  B. Snel,et al.  Toward Automatic Reconstruction of a Highly Resolved Tree of Life , 2006, Science.

[13]  E. Ambikairajah,et al.  On DNA Numerical Representations for Period-3 Based Exon Prediction , 2007, 2007 IEEE International Workshop on Genomic Signal Processing and Statistics.

[14]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[15]  Norman A. Doggett,et al.  Overview of Human Repetitive DNA Sequences , 2000, Current protocols in human genetics.

[16]  T Hodge,et al.  A myosin family tree. , 2000, Journal of cell science.

[17]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[18]  Andrzej K. Brodzik,et al.  Symbol-balanced quaternionic periodicity transform for latent pattern detection in DNA sequences , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[19]  Jianchang Ning,et al.  Preliminary wavelet analysis of genomic sequences , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[20]  Kequan Ding,et al.  Application of 2-D graphical representation of DNA sequence , 2005 .

[21]  R. Mantegna,et al.  Long-range correlation properties of coding and noncoding DNA sequences: GenBank analysis. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[22]  Bo Liao,et al.  A 2D graphical representation of DNA sequence , 2005 .

[23]  Milan Randić,et al.  Another look at the chaos-game representation of DNA , 2008 .

[24]  Vera Afreixo,et al.  Fourier analysis of symbolic data: A brief review , 2004, Digit. Signal Process..