Middle-range clustering of nucleotides in genomes

We propose a novel, transparent and very simple algorithm to analyze middle-range correlations in genomic nucleotide sequences. Analysis by this algorithm of the EMBL Nucleotide Sequence Database demonstrates that all four nucleotides cluster in the genomic nucleotide sequences of eukaryotes on the scale of several hundred base pairs. In prokaryotes, the clustering is weak but still evident. The non-dominant three bases are deficient in the clusters, while A is the most deficient nucleotide in the clusters of C, and vice versa, and G is the most deficient nucleotide in the clusters of T, and vice versa. The algorithm also detects CG islands, extending over 1 kb, in vertebrate sequences. In plants, the CG islands are shown to be much smaller, if they exist at all. A clustering tendency is also exhibited by the TA doublet. Other doublets do not cluster. We observe no strong correlation between nucleotides separated in genomes by > 1 kb.

[1]  R Nussinov,et al.  Strong adenine clustering in nucleotide sequences. , 1980, Journal of theoretical biology.

[2]  G. Bernardi,et al.  The isochore organization of the human genome. , 1989, Annual review of genetics.

[3]  D Larhammar,et al.  Biological origins of long-range correlations and compositional variations in DNA. , 1993, Nucleic acids research.

[4]  David R. Wolf,et al.  Base compositional structure of genomes. , 1992, Genomics.

[5]  C. Peng,et al.  Long-range correlations in nucleotide sequences , 1992, Nature.

[6]  G. Cameron,et al.  The EMBL data library. , 1988, Nucleic acids research.

[7]  J. Mrázek,et al.  Nucleotide composition bias and CpG dinucleotide content in the genomes of HIV and HTLV 1/2. , 1989, Biochimica et biophysica acta.

[8]  S Karlin,et al.  Patchiness and correlations in DNA sequences , 1993, Science.

[9]  T. Smith,et al.  A fundamental division in the Alu family of repeated sequences. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[10]  S. Nee,et al.  Uncorrelated DNA walks , 1992, Nature.

[11]  P. Munson,et al.  DNA correlations , 1992, Nature.

[12]  E N Trifonov,et al.  Linguistic measure of taxonomic and functional relatedness of nucleotide sequences. , 1990, Journal of biomolecular structure & dynamics.

[13]  J. Mrázek,et al.  Unusual codon usage of HIV , 1987, Nature.