Application of chaos game representation method to visualize genome structure

The recent availability of long and even complete genomic sequences opens a new field of research devoted to the general analysis of their global structure, without regard to gene interpretation. The exploration of such huge sequences (up to several megabases) needs new kind of data representation, allowing immediate visual interpretation of genomic structure and giving insights into the underlying mechanisms ruling it. Our approach takes advantages of the CGR (Chaos Game Representation) for creating images of large genomic sequences. The CGR method, modified here to allow for quantification, is an algorithm that produces pictures displaying frequencies of words (small sequences of the four nucleotides: G, A, T, C) and revealing nested patterns in DNA sequences. It is proved to be a quick and robust method to extract information from long DNA sequences allowing comparisons of sequences and detection of anomalies in frequency of words. Each species seems to be associated to a specific CGR image, which can therefore be considered as a genomic signature.

[1]  D. Forsdyke,et al.  Different biological species "broadcast" their DNAs at different (G+C)% "wavelengths". , 1996, Journal of theoretical biology.

[2]  R. Doolittle,et al.  Determining Divergence Times of the Major Kingdoms of Living Organisms with a Protein Clock , 1996, Science.

[3]  S Karlin,et al.  Compositional biases of bacterial genomes and evolutionary implications , 1997, Journal of bacteriology.

[4]  Jens G. Reich,et al.  Kohonen map as a visualization tool for the analysis of protein sequences: multiple alignments, domains and segments of secondary structures , 1996, Comput. Appl. Biosci..

[5]  Russell F. Doolittle,et al.  Microbial genomes opened up , 1998, Nature.

[6]  D. Searls,et al.  Gene structure prediction by linguistic methods. , 1994, Genomics.

[7]  A. Nandy,et al.  GRAPHICAL ANALYSIS OF DNA SEQUENCE STRUCTURE. II: RELATIVE ABUNDANCES OF NUCLEOTIDES IN DNAS, GENE EVOLUTION AND DUPLICATION , 1995 .

[8]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[9]  R. Ivarie,et al.  Mono- through hexanucleotide composition of the Escherichia coli genome: a Markov chain analysis. , 1987, Nucleic acids research.

[10]  C Dutta,et al.  Mathematical characterization of Chaos Game Representation. New algorithms for nucleotide sequence analysis. , 1992, Journal of molecular biology.

[11]  R. Britten,et al.  Rates of DNA sequence evolution differ between taxonomic groups. , 1986, Science.

[12]  H. Joel Jeffrey,et al.  Chaos game visualization of sequences , 1992, Comput. Graph..

[13]  S. Karlin,et al.  Frequent oligonucleotides and peptides of the Haemophilus influenzae genome. , 1996, Nucleic acids research.

[14]  R F Doolittle,et al.  Evolution by acquisition: the case for horizontal gene transfers. , 1992, Trends in biochemical sciences.

[15]  R C Mann,et al.  An artificial intelligence approach to DNA sequence feature recognition. , 1992, Trends in biotechnology.

[16]  N. Goldman,et al.  Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences. , 1993, Nucleic acids research.

[17]  J A Koziol,et al.  Evolution of the genome and the genetic code: selection at the dinucleotide level by methylation and polyribonucleotide cleavage. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[18]  S Karlin,et al.  Significant dispersed recurrent DNA sequences in the Escherichia coli genome. Several new groups. , 1993, Journal of molecular biology.

[19]  S. Karlin,et al.  Over- and under-representation of short oligonucleotides in DNA sequences. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[20]  A J Bleasby,et al.  Singular over-representation of an octameric palindrome, HIP1, in DNA from many cyanobacteria. , 1995, Nucleic acids research.

[21]  Ramón Román-Roldán,et al.  Application of information theory to DNA sequence analysis: A review , 1996, Pattern Recognit..

[22]  S. Karlin,et al.  Dinucleotide relative abundance extremes: a genomic signature. , 1995, Trends in genetics : TIG.

[23]  E V Koonin,et al.  Avoidance of palindromic words in bacterial and archaeal genomes: a close connection with restriction enzymes. , 1997, Nucleic acids research.

[24]  S Karlin,et al.  Computational DNA sequence analysis. , 1994, Annual review of microbiology.

[25]  S Karlin,et al.  Statistical analyses of counts and distributions of restriction sites in DNA sequences. , 1992, Nucleic acids research.