Ab Initio: Automatic Latin Proto-word Reconstruction

Proto-word reconstruction is central to the study of language evolution. It consists of recreating the words in an ancient language from its modern daughter languages. In this paper we investigate automatic word form reconstruction for Latin proto-words. Having modern word forms in multiple Romance languages (French, Italian, Spanish, Portuguese and Romanian), we infer the form of their common Latin ancestors. Our approach relies on the regularities that occurred when the Latin words entered the modern languages. We leverage information from all modern languages, building an ensemble system for proto-word reconstruction. We use conditional random fields for sequence labeling, but we conduct preliminary experiments with recurrent neural networks as well. We apply our method on multiple datasets, showing that our method improves on previous results, having also the advantage of requiring less input data, which is essential in historical linguistics, where resources are generally scarce.

[1]  Grzegorz Kondrak,et al.  Multiple Word Alignment with Profile Hidden Markov Models , 2009, HLT-NAACL.

[2]  Michael P. Oakes,et al.  Computer Estimation of Vocabulary in a Protolanguage from Word Lists in Four Daughter Languages , 2000, J. Quant. Linguistics.

[3]  Liviu P. Dinu,et al.  Automatic Detection of Cognates Using Orthographic Alignment , 2014, ACL.

[4]  Iryna Gurevych,et al.  Cognate Production using Character-based Machine Translation , 2013, IJCNLP.

[5]  Andrew Meade,et al.  Ultraconserved words point to deep language ancestry across Eurasia , 2013, Proceedings of the National Academy of Sciences.

[6]  Dan Klein,et al.  Automated reconstruction of ancient languages using probabilistic models of sound change , 2013, Proceedings of the National Academy of Sciences.

[7]  Lars Borin,et al.  Comparative Evaluation of String Similarity Measures for Automatic Language Classification , 2015, Sequences in Language and Text.

[8]  April McMahon,et al.  Swadesh sublists and the benefits of borrowing: An Andean case study , 2005 .

[9]  Paul Heggarty Beyond lexicostatistics: How to get more out of `word list' comparisons , 2010 .

[10]  Alina Maria Ciobanu Sequence Labeling for Cognate Production , 2016, KES.

[11]  Robert P. W. Duin,et al.  Limits on the majority vote accuracy in classifier fusion , 2003, Pattern Analysis & Applications.

[12]  Dan Klein,et al.  Finding Cognate Groups Using Phylogenies , 2010, ACL.

[13]  Vasudeva Varma,et al.  Statistical Transliteration for Cross Language Information Retrieval using HMM alignment model and CRF , 2008, IJCNLP.

[14]  B. Joseph,et al.  Historical Linguistics , 1999 .

[15]  Dan Klein,et al.  Improved Reconstruction of Protolanguage Word Forms , 2009, NAACL.

[16]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[17]  Dan Klein,et al.  A Probabilistic Approach to Diachronic Phonology , 2007, EMNLP-CoNLL.

[18]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[19]  Simon J. Greenhill,et al.  The Austronesian Basic Vocabulary Database: From Bioinformatics to Lexomics , 2008, Evolutionary bioinformatics online.

[20]  Michael A. Covington Alignment of Multiple Languages for Historical Comparison , 1998, COLING-ACL.

[21]  Maik Moeller,et al.  Historical And Comparative Linguistics , 2016 .

[22]  Noah A. Smith,et al.  Transliteration by Sequence Labeling with Lattice Encodings and Reranking , 2012, NEWS@ACL.

[23]  Quentin D Atkinson The descent of words , 2013, Proceedings of the National Academy of Sciences.

[24]  Liviu P. Dinu,et al.  Building a Dataset of Multilingual Cognates for the Romanian Lexicon , 2014, LREC.

[25]  Yulia Tsvetkov,et al.  Constraint-Based Models of Lexical Borrowing , 2015, NAACL.

[26]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[27]  Graeme Hirst,et al.  Algorithms for language reconstruction , 2002 .

[28]  P.-C.-F. Daunou,et al.  Mémoire sur les élections au scrutin , 1803 .

[29]  Liviu P. Dinu,et al.  Automatic Discrimination between Cognates and Borrowings , 2015, ACL.

[30]  Prasad Pingali,et al.  Statistical Transliteration for Cross Langauge Information Retrieval using HMM alignment and CRF , 2008, IJCNLP 2008.

[31]  Lyle Campbell,et al.  Historical Linguistics: An Introduction , 1991 .