Mathematical characterization of Chaos Game Representation. New algorithms for nucleotide sequence analysis.

Chaos Game Representation (CGR) can recognize patterns in the nucleotide sequences, obtained from databases, of a class of genes using the techniques of fractal structures and by considering DNA sequences as strings composed of four units, G, A, T and C. Such recognition of patterns relies only on visual identification and no mathematical characterization of CGR is known. The present report describes two algorithms that can predict the presence or absence of a stretch of nucleotides in any gene family. The first algorithm can be used to generate DNA sequences represented by any point in the CGR. The second algorithm can simulate known CGR patterns for different gene families by setting the probabilities of occurrence of different di- or trinucleotides by a trial and error process using some guidelines and approximate rules-of-thumb. The validity of the second algorithm has been tested by simulating sequences that can mimic the CGRs of vertebrate non-oncogenes, proto-oncogenes and oncogenes. These algorithms can provide a mathematical basis of the CGR patterns obtained using nucleotide sequences from databases.