A Pipeline for Computational Historical Linguistics

There are many parallels between historical linguistics and molecular phylogenetics. In this paper we describe an algorithmic pipeline that mimics, as closely as possible, the traditional workflow of language reconstruction known as the comparative method. The pipeline consists of suitably modified algorithms based on recent research in bioinformatics, which are adapted to the specifics of linguistic data. This approach can alleviate much of the laborious research needed to establish proof of historical relationships between languages. Equally important to our proposal is that each step in the workflow of the comparative method is implemented independently, so language specialists have the possibility to scrutinize intermediate results. We have used our pipeline to investigate two groups of languages, the Tsezic languages of the Caucasus and the Mataco-Guaicuruan languages of South America, based on the lexical data from the Intercontinental Dictionary Series (IDS). The results of these tests show that the current approach is a viable and useful extension to historical linguistic research.

[1]  T. Warnow,et al.  Perfect Phylogenetic Networks: A New Methodology for Reconstructing the Evolutionary History of Natural Languages , 2005 .

[2]  Charles Semple,et al.  Tree Reconstruction via a Closure Operation on Partial Splits , 2000, JOBIM.

[3]  Charles J. Colbourn,et al.  Lower bounds on multiple sequence alignment using exact 3-way alignment , 2007, BMC Bioinformatics.

[4]  W. Fitch Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology , 1971 .

[5]  O. Gotoh Alignment of three biological sequences with an efficient traceback procedure. , 1986, Journal of theoretical biology.

[6]  Cecil H. Brown,et al.  Automated classification of the world′s languages: a description of the method and preliminary results , 2008 .

[7]  J. Farris Phylogenetic Analysis Under Dollo's Law , 1977 .

[8]  John B. Lowe,et al.  The Reconstruction Engine: A Computer Implementation of the Comparative Method , 1994, CL.

[9]  J. Thompson,et al.  Using CLUSTAL for multiple sequence alignments. , 1996, Methods in enzymology.

[10]  S. Altschul,et al.  A tool for multiple sequence alignment. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Russell D. Gray,et al.  Language trees support the express-train sequence of Austronesian expansion , 2000, Nature.

[12]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[13]  R. Gray,et al.  Language-tree divergence times support the Anatolian theory of Indo-European origin , 2003, Nature.

[14]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[15]  John C. Wootton,et al.  The Construction and Use of Log-Odds Substitution Scores for Multiple Sequence Alignment , 2010, PLoS Comput. Biol..

[16]  Simon J. Greenhill,et al.  The shape and tempo of language evolution , 2010, Proceedings of the Royal Society B: Biological Sciences.

[17]  Daniel H. Huson,et al.  Dendroscope: An interactive viewer for large phylogenetic trees , 2007, BMC Bioinformatics.

[18]  Michael A. Covington Alignment of Multiple Languages for Historical Comparison , 1998, COLING-ACL.

[19]  James A. Matisoff Variational Semantics In Tibeto-Burman , 1978 .

[20]  Iain M. Wallace,et al.  M-Coffee: combining multiple sequence alignment methods with T-Coffee , 2006, Nucleic acids research.

[21]  Helma van den Berg,et al.  A Grammar of Hunzib (With Texts and Lexicon , 1997 .

[22]  Winfried Just,et al.  Computational Complexity of Multiple Sequence Alignment with SP-Score , 2001, J. Comput. Biol..

[23]  Robert D. Stevick,et al.  The Biological Model and Historical Linguistics , 1963 .

[24]  李幼升,et al.  Ph , 1989 .

[25]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[26]  D. Dediu A Bayesian phylogenetic approach to estimating the stability of linguistic features and the genetic biasing of tone , 2011, Proceedings of the Royal Society B: Biological Sciences.

[27]  Bernard Comrie,et al.  The Intercontinental Dictionary Series , 2011 .

[28]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[29]  Grzegorz Kondrak,et al.  Identification of Cognates and Recurrent Sound Correspondences in Word Lists , 2009, TAL.

[30]  D. Sankoff Minimal Mutation Trees of Sequences , 1975 .

[31]  Joseph Felsenstein,et al.  Maximum Likelihood and Minimum-Steps Methods for Estimating Evolutionary Trees from Data on Discrete Characters , 1973 .

[32]  Daniel Frynta,et al.  Cladistic analysis of Bantu languages: a new tree based on combined lexical and grammatical data , 2006, Naturwissenschaften.

[33]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[34]  D. Huson,et al.  Application of phylogenetic networks in evolutionary studies. , 2006, Molecular biology and evolution.

[35]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[36]  N. Platnick,et al.  Cladistic Methods in Textual, Linguistic, and Phylogenetic Analysis , 1977 .

[37]  Ran El-Yaniv,et al.  On Prediction Using Variable Order Markov Models , 2004, J. Artif. Intell. Res..

[38]  Michael A. Covington,et al.  An Algorithm to Align Words for Historical Comparison , 1996, Comput. Linguistics.

[39]  Quentin D Atkinson,et al.  Curious parallels and curious connections--phylogenetic thinking in biology and historical linguistics. , 2005, Systematic biology.

[40]  Simon J. Greenhill,et al.  Language Phylogenies Reveal Expansion Pulses and Pauses in Pacific Settlement , 2009, Science.

[41]  A. Dress,et al.  A canonical decomposition theory for metrics on a finite set , 1992 .

[42]  John Hewson,et al.  A computer-generated dictionary of proto-Algonquian , 1993 .

[43]  Hans J. Holm Genealogy of the Main Indo-European Branches Applying the Separation Base Method* , 2000, J. Quant. Linguistics.

[44]  B. John Oommen NORTH-HOLLAND String Alignment With Substitution , Insertion , Deletion , 2022 .

[45]  April McMahon,et al.  Quantifying change over time in phonetics , 2000 .

[46]  D. Lipman,et al.  The multiple sequence alignment problem in biology , 1988 .

[47]  John Nerbonne,et al.  Multiple Sequence Alignments in Linguistics , 2009, LaTeCH - SHELT&R@EACL.

[48]  M. Serva,et al.  Indo-European languages tree by Levenshtein distance , 2007, 0708.2971.

[49]  John Hewson Reconstructing Prehistoric Languages on the Computer: the triumph of the Electronic Neogrammarian , 1973, COLING.

[50]  Towhid Bin Muzaffar Computer simulation of Shawnee historical phonology , 2007 .

[51]  T. Warnow Mathematical approaches to comparative linguistics. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[52]  Michail Egorovic Alekseev,et al.  Sravnitel'no-istoriceskaja morfologija avaro-andijskich jazykov , 1988 .

[53]  David Gil,et al.  The World Atlas of Language Structures , 2005 .

[54]  Martin Haspelmath,et al.  The geometry of grammatical meaning: Semantic maps and cross-linguistic comparison , 2003 .

[55]  Christopher Ehret,et al.  Bayesian phylogenetic analysis of Semitic languages identifies an Early Bronze Age origin of Semitic in the Near East , 2009, Proceedings of the Royal Society B: Biological Sciences.

[56]  G. Wagner The developmental genetics of homology , 2007, Nature Reviews Genetics.

[57]  B. Joseph,et al.  Historical Linguistics , 1999 .

[58]  Wilbert Jan Heeringa Measuring dialect pronunciation differences using Levenshtein distance , 2004 .

[59]  Grzegorz Kondrak,et al.  N-Gram Similarity and Distance , 2005, SPIRE.

[60]  P. Sellers On the Theory and Computation of Evolutionary Distances , 1974 .

[61]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[62]  Paul Proulx,et al.  Time depth in historical linguistics , 2004 .

[63]  Grzegorz Kondrak,et al.  Phonetic Alignment and Similarity , 2003, Comput. Humanit..

[64]  R. Scotland,et al.  Deep homology: A view from systematics , 2010, BioEssays : news and reviews in molecular, cellular and developmental biology.

[65]  M. Swadesh Salish Internal Relationships , 1950, International Journal of American Linguistics.

[66]  J. Hopcroft,et al.  Efficient algorithms for graph manipulation , 1971 .

[67]  Michael Cysouw,et al.  Semantic maps as metrics on meanings , 2010 .

[68]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[69]  Peter J. Stuckey,et al.  Progressive Multiple Alignment Using Sequence Triplet Optimizations and Three-residue Exchange Costs , 2004, J. Bioinform. Comput. Biol..

[70]  Tandy Warnow,et al.  Indo‐European and Computational Cladistics , 2002 .

[71]  Peter F Stadler,et al.  Progressive multiple sequence alignments from triplets , 2007, BMC Bioinformatics.

[72]  Wayne P. Maddison,et al.  Outgroup Analysis and Parsimony , 1984 .

[73]  Steven Lee Hartman A universal alphabet for experiments in comparative phonology , 1981, Comput. Humanit..

[74]  Michael Cysouw,et al.  Cognate Identification and Alignment Using Practical Orthographies , 2007, SIGMORPHON.

[75]  J. Stoye Multiple sequence alignment with the Divide-and-Conquer method. , 1998, Gene.

[76]  John Whitfield,et al.  Across the Curious Parallel of Language and Species Evolution , 2008, PLoS biology.

[77]  D. Higgins,et al.  Multiple sequence alignments. , 2005, Current opinion in structural biology.

[78]  C. Holden,et al.  Bantu language trees reflect the spread of farming across sub-Saharan Africa: a maximum-parsimony analysis , 2002, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[79]  Grzegorz Kondrak,et al.  A New Algorithm for the Alignment of Phonetic Sequences , 2000, ANLP.

[80]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[81]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[82]  Graeme Hirst,et al.  Algorithms for language reconstruction , 2002 .

[83]  Lyle Campbell,et al.  American Indian languages : the historical linguistics of Native America , 1999 .

[84]  John B. Lowe Cross-linguistic Lexicographic Databases for Etymological Research, with Examples from Sino-Tibetan and Bantu Languages , 1995 .

[85]  S. L. Nikolayev,et al.  A North Caucasian Etymological Dictionary , 1994 .

[86]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[87]  Michael Cysouw,et al.  Reconstruction of morphosyntactic function: Nonspatial usage of spatial case marking in Tsezic , 2009 .

[88]  S. Levinson,et al.  Structural Phylogenetics and the Reconstruction of Ancient Language History , 2005, Science.

[89]  Daniel Frynta,et al.  Cladistic analysis of languages: Indo‐European classification based on lexicostatistical data , 2003 .