Computational historical linguistics

Abstract Computational approaches to historical linguistics have been proposed for half a century. Within the last decade, this line of research has received a major boost, owing both to the transfer of ideas and software from computational biology and to the release of several large electronic data resources suitable for systematic comparative work. In this article, some of the central research topics of this new wave of computational historical linguistics are introduced and discussed. These are automatic assessment of genetic relatedness, automatic cognate detection, phylogenetic inference and ancestral state reconstruction. They will be demonstrated by means of a case study of automatically reconstructing a Proto-Romance word list from lexical data of 50 modern Romance languages and dialects. The results illustrate both the strengths and the weaknesses of the current state of the art of automating the comparative method.

[1]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[2]  R. Gray,et al.  Language-tree divergence times support the Anatolian theory of Indo-European origin , 2003, Nature.

[3]  Stefan Th. Gries,et al.  Quantitative approaches to diachronic corpus linguistics , 2016 .

[4]  April McMahon,et al.  Splits or waves? Trees or webs? How divergence measures and network analysis can unravel language histories , 2010, Philosophical Transactions of the Royal Society B: Biological Sciences.

[5]  W. Fitch Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology , 1971 .

[6]  Taraka Rama,et al.  Phonotactic Diversity Predicts the Time Depth of the World’s Language Families , 2013, PloS one.

[7]  Sheila Embleton,et al.  Statistics in historical linguistics , 1986 .

[8]  John B. Lowe,et al.  The Reconstruction Engine: A Computer Implementation of the Comparative Method , 1994, CL.

[9]  Grzegorz Kondrak,et al.  Multilingual Cognate Identification using Integer Linear Programming , 2022 .

[10]  W. Fitch,et al.  Construction of phylogenetic trees. , 1967, Science.

[11]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[12]  Brett Kessler,et al.  Book Reviews: The Significance of Word Lists , 2001, CL.

[13]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[14]  Dong Xie,et al.  BEAST 2: A Software Platform for Bayesian Evolutionary Analysis , 2014, PLoS Comput. Biol..

[15]  Wilhelm Meyer-Lübke,et al.  Romanisches etymologisches Wörterbuch , 1932 .

[16]  M. Pietrusewsky,et al.  Craniometric variation in Southeast Asia and neighboring regions : a multivariate analysis of cranial measurements , 2008 .

[17]  Grzegorz Kondrak,et al.  Clustering Semantically Equivalent Words into Cognate Sets in Multilingual Lists , 2011, IJCNLP.

[18]  Dan Klein,et al.  Automated reconstruction of ancient languages using probabilistic models of sound change , 2013, Proceedings of the National Academy of Sciences.

[19]  Donald A. Ringe join On Calculating the Factor of Chance in Language Comparison , 1992 .

[20]  Swapan Mallick,et al.  Massive migration from the steppe was a source for Indo-European languages in Europe , 2015, Nature.

[21]  Johann-Mattis List,et al.  Using ancestral state reconstruction methods for onomasiological reconstruction in multilingual word lists , 2018, Language Dynamics and Change.

[22]  John P. Huelsenbeck,et al.  MrBayes 3: Bayesian phylogenetic inference under mixed models , 2003, Bioinform..

[23]  M. Swadesh Towards Greater Accuracy in Lexicostatistic Dating , 1955, International Journal of American Linguistics.

[24]  Graeme Hirst,et al.  Algorithms for language reconstruction , 2002 .

[25]  Alice B. Kehoe Archaeology and Language: The Puzzle of Indo-European Origins , 1989, American Antiquity.

[26]  Sean R. Eddy,et al.  Biological sequence analysis: Preface , 1998 .

[27]  Gregory R. Grant,et al.  Statistical Methods in Bioinformatics , 2001 .

[28]  Michael Weiss The Comparative Method , 2014 .

[29]  Instituttet for sammenlignende kulturforskning,et al.  The Comparative Method in Historical Linguistics , 1967 .

[30]  April M. S. McMahon,et al.  Language classification by numbers , 2005 .

[31]  Gerald M. Moser Changing Africa: The First Literary Generation of Independent Cape Verde , 1992 .

[32]  Simon J. Greenhill,et al.  Evolved structure of language shows lineage-specific trends in word-order universals , 2011, Nature.

[33]  M. Swadesh Lexico-Statistical Dating of Prehistoric Ethnic Contacts , 1952 .

[34]  Simon J. Greenhill,et al.  Languages Evolve in Punctuational Bursts , 2008, Science.

[35]  Peter Turchin Analyzing genetic connections between languages by matching consonant classes , 2010 .

[36]  Simon J. Greenhill,et al.  The Austronesian Basic Vocabulary Database: From Bioinformatics to Lexomics , 2008, Evolutionary bioinformatics online.

[37]  W. H. Baxter,et al.  Beyond lumping and splitting : probabilistic issues in historical linguistics , 1999 .

[38]  David W. Anthony,et al.  The Horse, the Wheel, and Language: How Bronze-Age Riders from the Eurasian Steppes Shaped the Modern World , 2008 .

[39]  Simon J. Greenhill,et al.  Mapping the Origins and Expansion of the Indo-European Language Family , 2012, Science.

[40]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[41]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[42]  Réka Albert,et al.  Near linear time algorithm to detect community structures in large-scale networks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[43]  April McMahon,et al.  Why linguists don’t do dates , 2006 .

[44]  Pavel Sofroniev,et al.  Automatic cognate classification with a Support Vector Machine , 2016, KONVENS.

[45]  Cecil H. Brown,et al.  Sound Correspondences in the World's Languages , 2013 .

[46]  Michael A. Covington,et al.  An Algorithm to Align Words for Historical Comparison , 1996, Comput. Linguistics.

[47]  Quentin D Atkinson,et al.  Curious parallels and curious connections--phylogenetic thinking in biology and historical linguistics. , 2005, Systematic biology.

[48]  A. Meillet,et al.  La méthode comparative en linguistique historique , 1925 .

[49]  Bret Larget,et al.  Bayesian Phylogenetics: Methods, Algorithms and Applications , 2015 .

[50]  Jeremy T. Fineman,et al.  Reconstruction of Evolutionary Trees , 2011, Encyclopedia of Parallel Computing.

[51]  J. Kruskal,et al.  An Indoeuropean classification : a lexicostatistical experiment , 1992 .

[52]  Guus Kroonen,et al.  Etymological Dictionary of Proto-Germanic , 2013 .

[53]  Mark Durie,et al.  The comparative method reviewed : regularity and irregularity in language change , 1997 .

[54]  Andrew Meade,et al.  Detecting Regular Sound Changes in Linguistics as Events of Concerted Evolution , 2015, Current Biology.

[55]  Ilia Peiros,et al.  Analyzing genetic connections between languages by matching consonant classes 1 , 2010 .

[56]  Russell D. Gray,et al.  Language trees support the express-train sequence of Austronesian expansion , 2000, Nature.

[57]  Taraka Rama Automatic cognate identification with gap-weighted string subsequences , 2015, HLT-NAACL.

[58]  A. von Haeseler,et al.  IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies , 2014, Molecular biology and evolution.

[59]  Martin Kay THE LOGIC OF COGNATE RECOGNITION IN HISTORICAL LINGUISTICS , 1964 .

[60]  Olivier Gascuel,et al.  Fast and Accurate Phylogeny Reconstruction Algorithms Based on the Minimum-Evolution Principle , 2002, WABI.

[61]  Lyle Campbell,et al.  Historical Linguistics: An Introduction , 1991 .

[62]  Robert Forkel,et al.  The World Atlas of Language Structures Online , 2009 .

[63]  Gerhard Jäger,et al.  Using support vector machines and state-of-the-art algorithms for phonetic alignment to identify cognates in multi-lingual wordlists , 2017, EACL.

[64]  Johann-Mattis List,et al.  LexStat: Automatic Detection of Cognates in Multilingual Wordlists , 2012, EACL 2012.

[65]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[66]  K. Lange Reconstruction of Evolutionary Trees , 1997 .

[67]  Johann-Mattis List,et al.  Sequence comparison in historical linguistics , 2021 .

[68]  Tandy Warnow,et al.  Indo‐European and Computational Cladistics , 2002 .

[69]  Michael P. Oakes,et al.  Computer Estimation of Vocabulary in a Protolanguage from Word Lists in Four Daughter Languages , 2000, J. Quant. Linguistics.

[70]  Gerhard Jäger,et al.  Phylogenetic Inference from Word Lists Using Weighted Alignment with Empirically Determined Weights , 2013 .

[71]  B. Joseph,et al.  Historical Linguistics , 1999 .

[72]  Vasilios K. Kimiskidis,et al.  Introduction , 2019, Int. J. Neural Syst..

[73]  M. Pagel,et al.  Frequency of word-use predicts rates of lexical evolution throughout Indo-European history , 2007, Nature.

[74]  Simon J. Greenhill,et al.  Language Phylogenies Reveal Expansion Pulses and Pauses in Pacific Settlement , 2009, Science.

[75]  Alexandre François,et al.  Trees, waves and linkages: models of language diversification , 2014 .

[76]  P. Hogeweg,et al.  The alignment of sets of sequences and the construction of phyletic trees: An integrated method , 2005, Journal of Molecular Evolution.

[77]  Paul Schliekelman,et al.  Statistical Methods in Bioinformatics: An Introduction , 2001 .

[78]  Dan Klein,et al.  Finding Cognate Groups Using Phylogenies , 2010, ACL.

[79]  Andrew Meade,et al.  Ultraconserved words point to deep language ancestry across Eurasia , 2013, Proceedings of the National Academy of Sciences.