The distribution of the frequency of occurrence of nucleotide subsequences, based on their overlap capability.

DNA's genetic code can be represented as an alphabetic sequence composed of the four letters A, C, G, and T, which represent the four types of nucleotides--adenylic, cytidylic, guanylic, and thymidylic acid--of which DNA is composed. Now that these sequences have been identified for many genes and are available in computer-readable form, scientists can analyze these data and search for patterns in an attempt to learn more about the regulatory functions of the gene. One area of study is that of the frequency of occurrence of specific nucleotide subsequences (e.g., ACAC) within part or all of a nucleotide sequence. This paper derives the probability distribution of the frequency of occurrence of a subsequence within a nucleotide sequence, under the hypothesis that the four nucleotides occur at random and with equal probability. This distribution is nontrivial because different subsequences have different "overlap capability." For example, the subsequence AAAA can occur up to 17 times in a sequence of length 20 (which would happen if the sequence were composed solely of A's), but the subsequence ACGT cannot occur more than 5 times in a sequence of length 20. Thus, the frequency distributions are different for each type of overlap capability. It is of interest to assess and compare the degree of nonrandomness for different subsequences or among different portions of a sequence; the existence and degree of nonrandomness may be related to the type and degree of functionality of a nucleotide (sub)sequence. The frequency distributions provided here can be used to perform exact significance tests of the hypothesis of randomness. An approximate test is also described for use with long sequences; this can be used to test a more general null hypothesis of nucleotides occurring with unequal probabilities.

[1]  R. D. Sege,et al.  A statistical test for comparing several nucleotide sequences , 1982, Nucleic Acids Res..

[2]  C. Fuchs On the distribution of the nucleotides in seven completely sequenced DNAs. , 1980, Gene.

[3]  P W Garden,et al.  Markov analysis of viral DNA/RNA sequences. , 1980, Journal of theoretical biology.

[4]  L. J. Korn,et al.  [60] Computer analysis of nucleic acids and proteins , 1980 .

[5]  W. Doolittle,et al.  Selfish genes, the phenotype paradigm and genome evolution , 1980, Nature.

[6]  The statistical analysis of direct repeats in nucleic acid sequences , 1985 .

[7]  J. D. Biggins,et al.  Markov renewal processes, counters and repeated sequences in Markov chains , 1987, Advances in Applied Probability.

[8]  M. O. Dayhoff,et al.  Nucleic acid sequence database IV. , 1982, DNA.

[9]  Manolo Gouy,et al.  Codon catalog usage is a genome strategy modulated for gene expressivity , 1981, Nucleic Acids Res..

[10]  M. Waterman,et al.  Statistical characterization of nucleic acid sequence functional domains. , 1983, Nucleic acids research.

[11]  M S Waterman,et al.  Regulatory pattern identification in nucleic acid sequences. , 1983, Nucleic acids research.

[12]  C. Aquadro,et al.  Human mitochondrial DNA variation and evolution: analysis of nucleotide sequences from seven individuals. , 1983, Genetics.

[13]  J. Maizel,et al.  Enhanced graphic matrix analysis of nucleic acid and protein sequences. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[14]  J. K. Vass,et al.  'ZSTATS' - a statistical analysis for potential Z-DNA sequences , 1984, Nucleic Acids Res..

[15]  R. Harr,et al.  Search algorithm for pattern match analysis of nucleic acid sequences. , 1983, Nucleic acids research.

[16]  C. Daniel Use of Half-Normal Plots in Interpreting Factorial Two-Level Experiments , 1959 .

[17]  F. G. Foster,et al.  An Introduction to Probability Theory and its Applications, Volume I (2Nd Edition) , 1958 .

[18]  B S Weir Statistical analysis of molecular genetic data. , 1985, IMA journal of mathematics applied in medicine and biology.

[19]  Samuel Karlin,et al.  Counts of long aligned word matches among random letter sequences , 1987, Advances in Applied Probability.

[20]  R Nussinov,et al.  Doublet frequencies in evolutionary distinct groups. , 1984, Nucleic acids research.

[21]  Leonidas J. Guibas,et al.  Periods in Strings , 1981, J. Comb. Theory, Ser. A.

[22]  F. Crick,et al.  Selfish DNA: the ultimate parasite , 1980, Nature.

[23]  L J Korn,et al.  Analysis of biological sequences on small computers. , 1984, DNA.