Historical Linguistics and Evolutionary Genetics. Based on Symbol Frequencies in Tamil Texts and DNA Sequences

Abstract We have studied the rank frequency distribution (RFD) of letters of the alphabet in Tamil language texts. In a novel application of rank frequencies, we have defined a simple intuitive distance parameter between a pair of strings (text or DNA sequence of codons). This distance correlates well with age difference in historical linguistics and evolutionary genetics. Using a distance matrix of a set of strings, we derive evolutionary trees that are broadly in agreement with historical evidence. The method has potential for refinement and application in evolutionary studies to complement other approaches to evolution. The RFD in a single string conforms to a law called the CMPL (Cumulative Modified Power Law), which we had formulated and applied to RFD's of diverse symbol sets.

[1]  service Topic collections Notes , .

[2]  R. Schiffer Psychobiology of Language , 1986 .

[3]  Jotun Hein,et al.  Statistical Methods in Bioinformatics: An Introduction , 2002 .

[4]  S. Naranan,et al.  Quantitative Linguistics and Complex System Studies , 1996, J. Quant. Linguistics.

[5]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[6]  S. Naranan,et al.  Information theoretic models in statistical linguistics. I: A model for word frequencies , 1992 .

[7]  Geoffrey Sampson,et al.  Word frequency distributions , 2002, Computational Linguistics.

[8]  S. Naranan,et al.  Power laws in statistical linguistics and related systems (Potenzgesetze in der quantitativen Linguistik und in verwandten Systemen) , 2005, Quantitative Linguistik / Quantitative Linguistics.

[9]  T Gojobori,et al.  Codon usage tabulated from the GenBank Genetic Sequence Data. , 1988, Nucleic acids research.

[10]  S. Naranan,et al.  Statistical analogs in DNA sequences and Tamil language texts: rank frequency distribution of symbols and their application to evolutionary genetics and historical linguistics , 2007, Exact Methods in the Study of Language and Text.

[11]  T Gojobori,et al.  Codon usage tabulated from the GenBank genetic sequence data. , 1991, Nucleic acids research.

[12]  Gift Siromoney,et al.  Entropy of Tamil Prose , 1963, Inf. Control..

[13]  S. Naranan,et al.  Information Theory and Algorithmic Complexity: Applications to Linguistic Discourses and DNA Sequences as Complex Systems Part I: Efficiency of the Genetic Code of DNA , 2000, J. Quant. Linguistics.

[14]  S. Naranan,et al.  Information Theory and Algorithmic Complexity: Applications to Language Discourses and DNA Sequences as Complex Systems Part II: Complexity of DNa Sequences, Analogy with Linguistic Discourses , 2000, J. Quant. Linguistics.

[15]  S. Naranan,et al.  Models for Power Law Relations in Linguistics and Information Science , 1998, J. Quant. Linguistics.

[16]  Paul Schliekelman,et al.  Statistical Methods in Bioinformatics: An Introduction , 2001 .

[17]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[18]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[19]  S. Naranan,et al.  Information theoretic models in statistical linguistics. II: Word frequencies and hierarchical structure in language-statistical tests , 1992 .

[20]  George Kingsley Zipf,et al.  The Psychobiology of Language , 2022 .

[21]  S. Naranan,et al.  Algorithmic information, complexity and Zipf's law , 2002, Glottometrics.