No signs of hidden language in noncoding DNA.

Recent comparison between the statistical propertie coding and noncoding DNA sequences have been in preted as indicating a yet-undiscovered language in n coding DNA [1]. We argue that greater variance amo nucleotide frequencies in noncoding regions explain m of the observations, which undercuts the claims in [1]. DNA sequences are long strings composed of four cleotides (A,C,G, and T). For a statistical analysis, th strings may be split into “words” of fixed length n. Then the word frequencies, pi , are computed. In [1] the Shan non redundancy Rsnd, Rsnd ­ 1 1 P4n i­1 pi log2 piy2n, of noncoding DNA was shown to be nonzero (as in na ral languages) and significantly larger than that of cod DNA. For n ­ 1, however, this simply reflects that nu cleotide frequencies are more unequal in noncoding t in coding DNA; Rs1d increases as the variance of the pi distribution increases. The increase in Rsnd asn increases is the same for coding and noncoding DNA (see Fig. 3 [1]) and thus does not distinguish between them. Furth more, it can be shown that correlations of finite rang r imply an increasingRsnd even forn . r. Such local correlations may be caused by simple mutation processe could originate from previously coding parts in nonco ing DNA [2]. In short, the systematically higher values Rsnd for noncoding than for coding DNA, which [1] argu to be suggestive of hidden language, arise simply beca the noncoding DNA has greater variance in its pi distribution than does coding DNA. In a “Zipf analysis” all possible4n words are ranked according to their frequencies, pi, from most to least frequent. Power-law behavior was noted in [1], visible a linear region in a double-logarithmic plot (see Fig. The slope for noncoding DNA was found to be larger th that for coding DNA, and close to that of English text, al analyzed with Zipf’s method and fixed word length. Th