Automatic Identification and Production of Related Words for Historical Linguistics

Language change across space and time is one of the main concerns in historical linguistics. In this article, we develop tools to assist researchers and domain experts in the study of language evolution. Firstly, we introduce a method to automatically determine if two words are cognates.We propose an algorithm for extracting cognates from electronic dictionaries that contain etymological information. Having built a dataset of related words, we further develop machine learning methods based on orthographic alignment for identifying cognates.We use aligned subsequences as features for classification algorithms in order to infer rules for linguistic changes undergone by words when entering new languages and to discriminate between cognates and non-cognates. Secondly, we extend the method to a finer-grained level, to identify the type of relationship between words. Discriminating between cognates and borrowings provides a deeper insight into the history of a language and allows a better characterization of language relatedness. We show that orthographic features have discriminative power and we analyze the underlying linguistic factors that prove relevant in the classification task. To our knowledge, this is the first attempt of this kind. Thirdly, we develop a machine learning method for automatically producing related words. We focus on reconstructing proto-words, but we also address two related sub-problems, producing modern word forms and producing cognates. The task of reconstructing proto-words consists in recreating the words in an ancient language from its modern daughter languages. Having modern word forms in multiple Romance languages, we infer the form of their common Latin ancestors. Our approach relies on the regularities that occurred when words entered the modern languages. We leverage information from several modern languages, building an ensemble system for reconstructing proto-words. We apply our method on multiple datasets, showing that our approach improves on previous results, having also has the advantage of requiring less input data, which is essential in historical linguistics, where resources are generally scarce.

[1]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[2]  Nello Cristianini,et al.  String Similarity Measures and Pam-like Matrices for Cognate Identification , 2010 .

[3]  G. Nicholls,et al.  FROM WORDS TO DATES: WATER INTO WINE, MATHEMAGIC OR PHYLOGENETIC INFERENCE? , 2005 .

[4]  José Gabriel Pereira Lopes,et al.  Measuring Spelling Similarity for Cognate Identification , 2011, EPIA.

[5]  Michael Cysouw,et al.  A Pipeline for Computational Historical Linguistics , 2011 .

[6]  James W. Minett,et al.  On detecting borrowing: distance-based and character-based , 2003 .

[7]  James W. Minett,et al.  Vertical and horizontal transmission in language evolution , 2005 .

[8]  Morris Swadesh,et al.  Perspectives and Problems of Amerindian Comparative Linguistics , 1954 .

[9]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[10]  Andrew Meade,et al.  Ultraconserved words point to deep language ancestry across Eurasia , 2013, Proceedings of the National Academy of Sciences.

[11]  G. Schuler,et al.  Sequence alignment and database searching. , 2001, Methods of biochemical analysis.

[12]  Carol Rosen,et al.  Romance Languages: A Historical Introduction , 2010 .

[13]  Liviu P. Dinu,et al.  Romanian Word Production: An Orthographic Approach Based on Sequence Labeling , 2017, CICLing.

[14]  Grzegorz Kondrak,et al.  Computing Word Similarity and Identifying Cognates with Pair Hidden Markov Models , 2005, CoNLL.

[15]  April McMahon,et al.  Swadesh sublists and the benefits of borrowing: An Andean case study , 2005 .

[16]  Paul Heggarty Beyond lexicostatistics: How to get more out of `word list' comparisons , 2010 .

[17]  Michel Simard,et al.  Using cognates to align sentences in bilingual corpora , 1993, TMI.

[18]  Simon J. Greenhill,et al.  The Austronesian Basic Vocabulary Database: From Bioinformatics to Lexomics , 2008, Evolutionary bioinformatics online.

[19]  Grzegorz Kondrak,et al.  Identifying Cognate Sets Across Dictionaries of Related Languages , 2017, EMNLP.

[20]  Nello Cristianini,et al.  Linguistic Phylogenetic Inference by PAM-like Matrices , 2012, J. Quant. Linguistics.

[21]  Viktor Pekar,et al.  Automatic Detection of Orthographics Cues for Cognate Recognition , 2006, LREC.

[22]  Dan Klein,et al.  Finding Cognate Groups Using Phylogenies , 2010, ACL.

[23]  Andrea Mulloni,et al.  Automatic Prediction of Cognate Orthography Using Support Vector Machines , 2007, ACL.

[24]  B. Joseph,et al.  Historical Linguistics , 1999 .

[25]  Job Schepens,et al.  Distributions of cognates in Europe as based on Levenshtein distance* , 2008, Bilingualism: Language and Cognition.

[26]  Simon J. Greenhill,et al.  The Potential of Automatic Word Comparison for Historical Linguistics , 2017, PloS one.

[27]  Steven Lee Hartman A universal alphabet for experiments in comparative phonology , 1981, Comput. Humanit..

[28]  Grzegorz Kondrak,et al.  Multiple Word Alignment with Profile Hidden Markov Models , 2009, HLT-NAACL.

[29]  Michael P. Oakes,et al.  Computer Estimation of Vocabulary in a Protolanguage from Word Lists in Four Daughter Languages , 2000, J. Quant. Linguistics.

[30]  Taraka Rama Siamese Convolutional Networks for Cognate Identification , 2016, COLING.

[31]  Liviu P. Dinu,et al.  Automatic Detection of Cognates Using Orthographic Alignment , 2014, ACL.

[32]  Iryna Gurevych,et al.  Cognate Production using Character-based Machine Translation , 2013, IJCNLP.

[33]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[34]  Vasudeva Varma,et al.  Statistical Transliteration for Cross Language Information Retrieval using HMM alignment model and CRF , 2008, IJCNLP.

[35]  Philipp Koehn,et al.  Estimating Word Translation Probabilities from Unrelated Monolingual Corpora Using the EM Algorithm , 2000, AAAI/IAAI.

[36]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[37]  Dan Klein,et al.  Automated reconstruction of ancient languages using probabilistic models of sound change , 2013, Proceedings of the National Academy of Sciences.

[38]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[39]  Grzegorz Kondrak,et al.  A New Algorithm for the Alignment of Phonetic Sequences , 2000, ANLP.

[40]  Graeme Hirst,et al.  Algorithms for language reconstruction , 2002 .

[41]  P.-C.-F. Daunou,et al.  Mémoire sur les élections au scrutin , 1803 .

[42]  Bali Ranaivo-Malançon,et al.  Identification of Closely Related Indigenous Languages: An Orthographic Approach , 2009, 2009 International Conference on Asian Language Processing.

[43]  Daniel Marcu,et al.  Cognates Can Improve Statistical Translation Models , 2003, NAACL.

[44]  David Brodsky,et al.  Spanish Vocabulary: An Etymological Approach , 2008 .

[45]  Liviu P. Dinu,et al.  Automatic Discrimination between Cognates and Borrowings , 2015, ACL.

[46]  Quentin D Atkinson The descent of words , 2013, Proceedings of the National Academy of Sciences.

[47]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[48]  Prasad Pingali,et al.  Statistical Transliteration for Cross Langauge Information Retrieval using HMM alignment and CRF , 2008, IJCNLP 2008.

[49]  Claire Cardie,et al.  Using clustering and SuperConcepts within SMART: TREC 6 , 1997, Inf. Process. Manag..

[50]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[51]  Robert A. Hall,et al.  Linguistics And Your Language , 1960 .

[52]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[53]  Grzegorz Kondrak,et al.  Combining Evidence in Cognate Identification , 2004, Canadian AI.

[54]  R. Gray,et al.  Language-tree divergence times support the Anatolian theory of Indo-European origin , 2003, Nature.

[55]  Noah A. Smith,et al.  Transliteration by Sequence Labeling with Lattice Encodings and Reranking , 2012, NEWS@ACL.

[56]  Lars Borin,et al.  Comparative Evaluation of String Similarity Measures for Automatic Language Classification , 2015, Sequences in Language and Text.

[57]  Liviu P. Dinu,et al.  Ab Initio: Automatic Latin Proto-word Reconstruction , 2018, COLING.

[58]  Diana Inkpen,et al.  Automatic Identification of Cognates and False Friends in French and English , 2005 .

[59]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[60]  Wilbert Heeringa,et al.  Phonetic and Lexical Predictors of Intelligibility , 2008, Int. J. Humanit. Arts Comput..

[61]  Amalia Todirascu-Courtier,et al.  Using Cognates in a French-Romanian Lexical Alignment System: A Comparative Study , 2011, RANLP.

[62]  Ying Zhang,et al.  Distributed Language Modeling for N-best List Re-ranking , 2006, EMNLP.

[63]  Liviu P. Dinu,et al.  An Etymological Approach to Cross-Language Orthographic Similarity. Application on Romanian , 2014, EMNLP.

[64]  I. Dan Melamed,et al.  Automatic Evaluation and Uniform Filter Cascades for Inducing N-Best Translation Lexicons , 1995, VLC@ACL.

[65]  Chris Brew,et al.  Word-Pair Extraction for Lexicography , 1996 .

[66]  Johann-Mattis List,et al.  LexStat: Automatic Detection of Cognates in Multilingual Wordlists , 2012, EACL 2012.

[67]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[68]  Michael Ashby,et al.  Introducing Phonetic Science , 2005 .

[69]  Ana-Maria Barbu,et al.  Romanian Lexical Data Bases: Inflected and Syllabic Forms Dictionaries , 2008, LREC.

[70]  Yulia Tsvetkov,et al.  Constraint-Based Models of Lexical Borrowing , 2015, NAACL.

[71]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[72]  Kenneth Ward Church Char_align: A Program for Aligning Parallel Texts at the Character Level , 1993, ACL.

[73]  Dan Klein,et al.  A Probabilistic Approach to Diachronic Phonology , 2007, EMNLP-CoNLL.

[74]  Charles L. Eastlack Iberochange: A program to simulate systematic sound change in Ibero-Romance , 1977 .

[75]  Michael A. Covington Alignment of Multiple Languages for Historical Comparison , 1998, COLING-ACL.

[76]  Gerhard Jäger,et al.  Computational historical linguistics , 2018, Theoretical Linguistics.

[77]  Dan Klein,et al.  Improved Reconstruction of Protolanguage Word Forms , 2009, NAACL.

[78]  Liviu P. Dinu,et al.  Building a Dataset of Multilingual Cognates for the Romanian Lexicon , 2014, LREC.