Languages with longer words have more lexical change

The findings to be presented in this paper were not anticipated, but came about as an unexpected result of looking at how the application of a version of the Levenshtein distance to word lists compares with cognate counting. We were interested in the degree to which the two correlate. The results of this investigation are intrinsically interesting and will be presented in the following section 2, but even more interesting is our finding that differences between counting cognates and measuring the Levenshtein distances vary as a function of average word lengths in the word lists compared. This observation will occupy the remainder of the paper, with section 3 devoted to establishing the sta tis tical significance of the observation across language families, while section 4 establishes the significance within language groups, and section 5 discusses competing explanations. First we briefly explain the specific version of the Levenshtein distance used and the concept of cognate identification. In numerous previous papers, beginning in Holman et al. (2008a), the present authors as well as other members of the network of scholars participating in the project known as ASJP (or Automated Similarity Judgment Pro gram) have applied a computer-assisted comparison of word lists in order to derive a measure of differences among languages. Our method consists in comparing pairs of words to determine the Levenshtein distance, LD, which is defined as the number of substitutions, insertions, and deletions necessary to transform one word into another. The LD is divided by the length of the longer of the two words compared such that any distance will come to lie in the range 0%–100%. This normalized measure, called LDN,2 is averaged over all pairs of words referring to the same concept in lists from two given languages. To enhance discrimination between related and unrelated languages, this average LDN is further divided by the average LDN between words referring to dif ferent concepts in the different lists, to obtain what we call LDND (‘Leven shtein Distance Normalized Divided’). A similarity measure, here called ASJPsim, is defined by subtracting LDND from 100%.

[1]  Simon J. Greenhill,et al.  The Austronesian Basic Vocabulary Database: From Bioinformatics to Lexomics , 2008, Evolutionary bioinformatics online.

[2]  Sarah C. Gudschinsky The ABC'S of Lexicostatistics (Glottochronology) , 1956 .

[3]  Daniel Nettle,et al.  Coevolution of Phonology and the Lexicon in Twelve Languages of West Africa , 1998, J. Quant. Linguistics.

[4]  Michael Cysouw,et al.  A critique of the separation base method for genealogical subgrouping, with data from mixe-zoquean , 2006, J. Quant. Linguistics.

[5]  A Glottochronological Study on Three Okinawan Dialects , 1961, International Journal of American Linguistics.

[6]  Eric W. Holman,et al.  Population Size and Rates of Language Change , 2009, Human biology.

[7]  M. Dryer The Greenbergian word order correlations , 1992 .

[8]  M. Swadesh Lexico-Statistical Dating of Prehistoric Ethnic Contacts , 1952 .

[9]  A. Rumsey,et al.  Worrorran Revisited: The Case for Genetic Relations Among Languages of the Northern Kimberley Region of Western Australia , 2009 .

[10]  Deryle Lonsdale,et al.  Positing Language Relationships Using ALINE , 2011 .

[11]  Simon J. Greenhill Levenshtein Distances Fail to Identify Language Relationships Accurately , 2011, CL.

[12]  Isidore Dyen,et al.  THE LEXICOSTATISTICAL CLASSIFICATION OF THE AUSTRONESIAN LANGUAGES. , 1963 .

[13]  Robert B. Lees,et al.  The Basis of Glottochronology , 1953 .

[14]  Wick R. Miller,et al.  The Classification of the Uto-Aztecan Languages Based on Lexical Evidence , 1984, International Journal of American Linguistics.

[15]  Christian Schulze,et al.  Do Language Change Rates Depend on Population Size? , 2007, Adv. Complex Syst..

[16]  Thomas A. Sebeok,et al.  WEST ATLANTIC: AN INVENTORY OF THE LANGUAGES, THEIR NOUN CLASS SYSTEMS AND CONSONANT ALTERNATION , 1971 .

[17]  G. Breen The Mayi languages of the Queensland Gulf country , 1981 .

[18]  Eric W. Holman,et al.  Evaluating linguistic distance measures , 2010 .

[19]  Taraka Rama,et al.  Phonological diversity, word length, and population sizes across languages: The ASJP evidence , 2011 .

[20]  Simon J. Greenhill,et al.  Languages Evolve in Punctuational Bursts , 2008, Science.

[21]  M. Bromley THE LINGUISTIC RELATIONSHIPS OF GRAND VALLEY DANI: A LEXICO‐STATISTICAL CLASSIFICATION , 1967 .

[22]  Dik Bakker,et al.  Glottochronology as a Heuristic for Genealogical Language Relationships , 2010, J. Quant. Linguistics.

[23]  K. A. McElhanon,et al.  Preliminary Observations on Huon Peninsula Languages , 1967 .

[24]  G. O'Grady,et al.  Proto-Ngayarda Phonology , 1966 .

[25]  B. Hooley Austronesian Languages of the Morobe District, Papua New Guinea , 1971 .

[26]  Wilbert Jan Heeringa Measuring dialect pronunciation differences using Levenshtein distance , 2004 .

[27]  M. Swadesh Salish Internal Relationships , 1950, International Journal of American Linguistics.

[28]  J. Kruskal,et al.  An Indoeuropean classification : a lexicostatistical experiment , 1992 .

[29]  P. K. Benedict Sino-Tibetan: Another Look , 1976 .

[30]  Fiona M. Jordan,et al.  Macro-evolutionary studies of cultural diversity: a review of empirical studies of cultural transmission and cultural adaptation , 2011, Philosophical Transactions of the Royal Society B: Biological Sciences.

[31]  Grzegorz Kondrak,et al.  A New Algorithm for the Alignment of Phonetic Sequences , 2000, ANLP.

[32]  Russell D. Gray,et al.  Language trees support the express-train sequence of Austronesian expansion , 2000, Nature.

[33]  M. Pagel,et al.  Frequency of word-use predicts rates of lexical evolution throughout Indo-European history , 2007, Nature.

[34]  Pierre Verin,et al.  The glottochronology of Malagasy speech communities , 1975 .

[35]  Daniel Nettle,et al.  Segmental inventory size, word length, and communicative efficiency , 1995 .

[36]  Søren Wichmann,et al.  Explorations in automated language classification , 2008 .

[37]  Ilia Peiros,et al.  Comparative linguistics in Southeast Asia , 1998 .

[38]  Rudolph C. Troike The Glottochronology of Six Turkic Languages , 1969, International Journal of American Linguistics.

[39]  Harald Hammarström,et al.  Automated Dating of the World’s Language Families Based on Lexical Similarity , 2011, Current Anthropology.

[40]  Laurie Bauer,et al.  Phoneme inventory size and population size , 2007 .

[41]  Vittorio Loreto,et al.  On the Accuracy of Language Trees , 2011, PloS one.

[42]  H. Hoijer The Chronology of the Athapaskan Languages , 1956, International Journal of American Linguistics.

[43]  Michael Mann,et al.  Continuity and divergence in the Bantu languages : perspectives from a lexicostatistic study , 1999 .

[44]  M. Swadesh Towards Greater Accuracy in Lexicostatistic Dating , 1955, International Journal of American Linguistics.

[45]  Simon J. Greenhill,et al.  Language Phylogenies Reveal Expansion Pulses and Pauses in Pacific Settlement , 2009, Science.

[46]  Karl Jost,et al.  Zipl George K. The Psycho-Biology of Language. Boston, Houghton Mifflin Company 1935. 336 S. 4° , 1937 .

[47]  Mario Cortina-Borja,et al.  Some Remarks on Uto-Aztecan Classification , 1989, International Journal of American Linguistics.

[48]  R. Walker,et al.  Bayesian phylogeography of the Arawak expansion in lowland South America , 2011, Proceedings of the Royal Society B: Biological Sciences.

[49]  Jan P. Sterk,et al.  South central Niger-Congo: a reclassification , 2010 .

[50]  Cecil H. Brown,et al.  Automated classification of the world′s languages: a description of the method and preliminary results , 2008 .

[51]  B. Heine Die Verbreitung und Gliederung der Togorestsprachen , 1968 .

[52]  M. Serva,et al.  Indo-European languages tree by Levenshtein distance , 2007, 0708.2971.

[53]  Viveka Velupillai,et al.  Homelands of the world’s language families: a quantitative approach , 2010 .