Universal sequence map (USM) of arbitrary discrete sequences

BackgroundFor over a decade the idea of representing biological sequences in a continuous coordinate space has maintained its appeal but not been fully realized. The basic idea is that any sequence of symbols may define trajectories in the continuous space conserving all its statistical properties. Ideally, such a representation would allow scale independent sequence analysis – without the context of fixed memory length. A simple example would consist on being able to infer the homology between two sequences solely by comparing the coordinates of any two homologous units.ResultsWe have successfully identified such an iterative function for bijective mappingψ of discrete sequences into objects of continuous state space that enable scale-independent sequence analysis. The technique, named Universal Sequence Mapping (USM), is applicable to sequences with an arbitrary length and arbitrary number of unique units and generates a representation where map distance estimates sequence similarity. The novel USM procedure is based on earlier work by these and other authors on the properties of Chaos Game Representation (CGR). The latter enables the representation of 4 unit type sequences (like DNA) as an order free Markov Chain transition table. The properties of USM are illustrated with test data and can be verified for other data by using the accompanying web-based tool:http://bioinformatics.musc.edu/~jonas/usm/.ConclusionsUSM is shown to enable a statistical mechanics approach to sequence analysis. The scale independent representation frees sequence analysis from the need to assume a memory length in the investigation of syntactic rules.

[1]  N. Goldman,et al.  Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences. , 1993, Nucleic acids research.

[2]  D. Roos,et al.  Bioinformatics--Trying to Swim in a Sea of Data , 2001, Science.

[3]  B. Arnold,et al.  A first course in order statistics , 1994 .

[4]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[5]  Ramón Román-Roldán,et al.  Application of information theory to DNA sequence analysis: A review , 1996, Pattern Recognit..

[6]  M. Enright,et al.  Molecular Typing of Bacteria Directly from Cerebrospinal Fluid , 2000, European Journal of Clinical Microbiology and Infectious Diseases.

[7]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[8]  P. Deschavanne,et al.  Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. , 1999, Molecular biology and evolution.

[9]  Ramón Román-Roldán,et al.  Entropic feature for sequence pattern through iterated function systems , 1994, Pattern Recognit. Lett..

[10]  K A Hill,et al.  The evolution of species-type specificity in the global DNA sequence organization of mitochondrial genomes. , 1997, Genome.

[11]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[12]  Jonas S. Almeida,et al.  Analysis of genomic sequences by Chaos Game Representation , 2001, Bioinform..

[13]  J. Gern The Sequence of the Human Genome , 2001, Science.

[14]  A. Fiser,et al.  Chaos game representation of protein structures. , 1994, Journal of molecular graphics.

[15]  Peter Ti Spatial Representation of Symbolic Sequences Through Iterative Function Systems , 1999 .

[16]  Thomas J. Liesegang,et al.  The sequence of the human genome. Venter JC,∗ Adams MD, Myers EW, et al. Science 2001;291:1304–1351. , 2001 .

[17]  A Nandy Recent investigations into global characteristics of long DNA sequences. , 1994, Indian journal of biochemistry & biophysics.

[18]  Edward R. Vrscay,et al.  “Chaos games” for iterated function systems with grey level maps , 1998 .

[19]  J. Oliver,et al.  Entropic profiles of DNA sequences through chaos-game-derived images. , 1993, Journal of theoretical biology.

[20]  D. Farnsworth A First Course in Order Statistics , 1993 .

[21]  E Fleck,et al.  Representation of amino acid sequences as two‐dimensional point patterns , 1997, Electrophoresis.

[22]  S. Basu,et al.  Chaos game representation of proteins. , 1997, Journal of molecular graphics & modelling.

[23]  Ramón A. Mata-Toledo,et al.  Visualization of random sequences using the chaos game algorithm , 1997, J. Syst. Softw..

[24]  Peter Tiño,et al.  Spatial representation of symbolic sequences through iterative function systems , 1999, IEEE Trans. Syst. Man Cybern. Part A.

[25]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.