Algorithm for Coding DNA Sequences into "Spectrum-like" and "Zigzag" Representations

An algorithm for encoding long strings of building blocks, like 4 DNA bases (adenine-A, cytosine-C, thymine-T, and guanidine-G), 20 natural amino acids (from Alanine Ala to Valine-Val, plus the stop triplet), or all 64 possible base triplets (from AAA to TTT), into "zigzag" or "spectrum-like" representations is suggested. The new encoding scheme can be derived in the 3-, 2-, or 1-dimensional form depending on the user's wishes. The only information, besides the string for which the "spectrum-like" representation is sought, is the initial positioning of the complete set of units from which the string is composed, i.e., four positions for A, C, G, and T, or 20 positions for natural amino acids plus stop, etc. This initial positioning can be initialized in either the 3-, 2-, or 1-D form. As an illustration of the suggested encoding scheme of the visual and chemometric comparison of the first 10 exon strings of the beta globin gene of 10 different species, each string consisting of about 100 basic amino acids long is shown.

[1]  Jin Xu,et al.  Some Notes on 2-D Graphical Representation of DNA Sequence , 2002, J. Chem. Inf. Comput. Sci..

[2]  Bo Liao,et al.  General Combinatorics of RNA Hairpins and Cloverleaves , 2003, J. Chem. Inf. Comput. Sci..

[3]  S. Basu,et al.  Chaos game representation of proteins. , 1997, Journal of molecular graphics & modelling.

[4]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[5]  Milan Randic,et al.  On the Similarity of DNA Primary Sequences , 2000, J. Chem. Inf. Comput. Sci..

[6]  M. Randic,et al.  Highly compact 2D graphical representation of DNA sequences , 2004, SAR and QSAR in environmental research.

[7]  Dejan Plavšić,et al.  Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation , 2003 .

[8]  M. Randic,et al.  2-D Graphical representation of proteins based on virtual genetic code , 2004, SAR and QSAR in environmental research.

[9]  N. Goldman,et al.  Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences. , 1993, Nucleic acids research.

[10]  Alexandru T. Balaban,et al.  Unique graphical representation of protein sequences based on nucleotide triplet codons , 2004 .

[11]  Brian Everitt,et al.  Clustering of large data sets , 1983 .

[12]  Milan Randić Graphical representations of DNA as 2-D map , 2004 .

[13]  Milan Randić,et al.  On characterization of DNA primary sequences by a condensed matrix , 2000 .

[14]  Marjan Vracko,et al.  Compact 2-D graphical representation of DNA , 2003 .

[15]  Milan Randic Condensed Representation of DNA Primary Sequences , 2000, J. Chem. Inf. Comput. Sci..

[16]  Dejan Plavšić,et al.  Novel 2-D graphical representation of DNA sequences and their numerical characterization , 2003 .

[17]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[18]  Milan Randic,et al.  A novel 2-D graphical representation of DNA sequences of low degeneracy , 2001 .

[19]  Milan Randic,et al.  On 3-D Graphical Representation of DNA Primary Sequences and Their Numerical Characterization , 2000, J. Chem. Inf. Comput. Sci..

[20]  Milan Randic,et al.  On A Four-Dimensional Representation of DNA Primary Sequences , 2003, J. Chem. Inf. Comput. Sci..