A Methodology to Measure the Diachronic Language Distance between Three Languages Based on Perplexity

The aim of this paper is to apply a corpus-based methodology, based on the measure of perplexity, to automatically calculate the cross-lingual language distance between historical periods of three ...

[1]  Elke Teich,et al.  Using relative entropy for detection and analysis of periods of diachronic linguistic change , 2018, LaTeCH@COLING.

[2]  Søren Wichmann,et al.  Explorations in automated language classification , 2008 .

[3]  M. Pérez-Pereira Chapter 2. Early Galician/Spanish Bilingualism: Contrasts with Monolingualism , 2007 .

[4]  Filippo Petroni,et al.  Automated Word Stability and Language Phylogeny* , 2011, J. Quant. Linguistics.

[5]  Rafael Lapesa,et al.  Historia de la lengua española , 1955 .

[6]  Erhard W. Hinrichs,et al.  Linguistic Distances , 2006 .

[7]  Brett Kessler,et al.  Computational dialectology in Irish Gaelic , 1995, EACL.

[8]  Sebastian Otten,et al.  The Costs of Babylon – Linguistic Distance in Applied Economics , 2012 .

[9]  Shervin Malmasi,et al.  Modeling Language Change in Historical Corpora: The Case of Portuguese , 2016, LREC.

[10]  X. R. F. Mato,et al.  Gramática da lingua galega II: Morfosintaxe , 2000 .

[11]  Lars Borin,et al.  Comparative Evaluation of String Similarity Measures for Automatic Language Classification , 2015, Sequences in Language and Text.

[12]  Atte Hinkka Data-driven Language Typology , 2018 .

[13]  John Nerbonne,et al.  Measuring Dialect Distance Phonetically , 1997, SIGMORPHON@EACL.

[14]  Rico Sennrich,et al.  Perplexity Minimization for Translation Model Domain Adaptation in Statistical Machine Translation , 2012, EACL.

[15]  Ernesto Xosé González Seoane A ortografía e a gramática do galego nos estudios gramaticais do século XIX e primeiros anos do XX , 1992 .

[16]  Meritxell González An Analysis of Twitter Corpora and the Differences between Formal and Colloquial Tweets , 2015, TweetMT@SEPLN.

[17]  Damien Mooney,et al.  Creating Orthographies for Endangered Languages , 2017 .

[18]  Matthew Purver,et al.  A Simple Baseline for Discriminating Similar Languages , 2014, VarDial@COLING.

[19]  Søren Wichmann,et al.  How to distinguish languages and dialects , 2020, Computational Linguistics.

[20]  Iñaki Alegria,et al.  Measuring language distance among historical varieties using perplexity. Application to European Portuguese. , 2018, VarDial@COLING 2018.

[21]  Laura Villa A Political History of Spanish: The officialization of Spanish in mid-nineteenth-century Spain: the Academy's authority , 2013 .

[22]  T. Warnow,et al.  Perfect Phylogenetic Networks: A New Methodology for Reconstructing the Evolutionary History of Natural Languages , 2005 .

[23]  Iñaki Alegria,et al.  Measuring diachronic language distance using perplexity: Application to English, Portuguese, and Spanish , 2019, Natural Language Engineering.

[24]  Sandra M. Aluísio,et al.  Discriminating between Similar Languages with Word-level Convolutional Neural Networks , 2017, VarDial.

[25]  Segismundo Spina,et al.  História da língua portuguesa , 2008 .

[26]  Fernando F. Ramallo,et al.  The Standardization of Galician , 2015 .

[27]  Luay Nakhleh,et al.  An experimental study comparing linguistic phylogenetic reconstruction methods , 2013 .

[28]  Sidsel Boldsen,et al.  Identifying Temporal Trends Based on Perplexity and Clustering: Are We Looking at Language Change? , 2019 .

[29]  Cecil H. Brown,et al.  Adding typology to lexicostatistics: A combined approach to language classification , 2009 .

[30]  Thamar Solorio,et al.  Overview for the Second Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[31]  Ramón Mariño Historia de la lengua gallega , 2008 .

[32]  Iñaki Alegria,et al.  From language identification to language distance , 2017 .

[33]  Barbara Buchenau Introduction to Literary Studies , 2017 .

[34]  M. M. Azevedo Portuguese: A Linguistic Introduction , 2005 .

[35]  Marcos Zampieri,et al.  Automatic identification of language varieties: The case of Portuguese , 2012, KONVENS.

[36]  Stephen H. Richman Spanish‐Portuguese agreement in affixed words , 1970 .

[37]  Francisco Dubert,et al.  On quantitative geolinguistics: an illustration from Galician dialectology , 2016 .

[38]  José Ramom Pichel Campos,et al.  TMILG (Tesouro Medieval Informatizado da Lingua Galega) , 2008, Proces. del Leng. Natural.

[39]  Johann-Mattis List,et al.  Sequence comparison in computational historical linguistics , 2018, Journal of Language Evolution.

[40]  Henrique Monteagudo,et al.  Galician and Castilian in contact: historical, social, and linguistic aspects , 1993 .

[41]  Sebastian Otten,et al.  The Costs of Babylon - Linguistic Distance in Applied Economics , 2013 .

[42]  Joel West,et al.  Reproduced with Permission of the Copyright Owner. Further Reproduction Prohibited without Permission. @bullet M1r Management International Review a Linguistic-based Measure of Cultural Distance and Its Relationship to Managerial Values Key Results , 2022 .

[43]  B. Chiswick,et al.  Linguistic Distance: A Quantitative Measure of the Distance Between English and Other Languages , 2004, SSRN Electronic Journal.

[44]  Li Yujian,et al.  A Normalized Levenshtein Distance Metric , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[46]  Carmen García-Mateo,et al.  Adaptation strategies for the acoustic and language models in bilingual speech transcription , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[47]  M. Serva,et al.  Indo-European languages tree by Levenshtein distance , 2007, 0708.2971.

[48]  Iñaki Alegria,et al.  Cross-lingual Diachronic Distance: Application to Portuguese and Spanish , 2019, Proces. del Leng. Natural.

[49]  Carl Vogel,et al.  Using character N-grams to explorediachronic change in medieval English , 2019, Folia Linguistica.

[50]  Filippo Petroni,et al.  Measures of lexical distance between languages , 2009, ArXiv.

[51]  Harald Hammarström,et al.  Glottolog/Langdoc: Defining Dialects, Languages, and Language Families as Collections of Resources , 2011, LISC.

[52]  Josef van Genabith,et al.  Comparing Approaches to the Identification of Similar Languages , 2015 .

[53]  Iñaki Alegria,et al.  Comparing Two Basic Methods for Discriminating Between Similar Languages and Varieties , 2016, VarDial@COLING.

[54]  M. Swadesh Lexico-Statistical Dating of Prehistoric Ethnic Contacts , 1952 .

[55]  L. F. Brehm Los 1001 años de la lengua española , 1980 .

[56]  J. Mira,et al.  Interlinguistic similarity and language death dynamics , 2005 .

[57]  Douglas Biber,et al.  Representativeness in corpus design , 1993 .

[58]  RockOn Team,et al.  Re: Attenuation compensation in single-photon emission tomography: a comparative evaluation. , 1983, Journal of nuclear medicine : official publication, Society of Nuclear Medicine.

[59]  Paolo Rosso,et al.  Stance Evolution and Twitter Interactions in an Italian Political Debate , 2018, NLDB.

[60]  Yuyang Gao,et al.  Comparison of directed and weighted co-occurrence networks of six languages , 2014 .

[61]  Cyril Goutte,et al.  Discriminating Similar Languages: Evaluations and Explorations , 2016, LREC.

[62]  Lucia Specia,et al.  Quality Estimation for Machine Translation , 2018, Computational Linguistics.

[63]  H. Kloss Abstand Languages and Ausbau Languages , 2016 .

[64]  Ehsaneddin Asgari,et al.  Comparing Fifty Natural Languages and Twelve Genetic Languages Using Word Embedding Language Divergence (WELD) as a Quantitative Measure of Language Distance , 2016, NAACL 2016.

[65]  Peter Dirix,et al.  Conditional Entropy Measures Intelligibility among Related Languages , 2007 .

[66]  Matti Rissanen,et al.  Early English in the Computer Age: Explorations Through the Helsinki Corpus , 1993 .

[67]  Caroline H. Vickers,et al.  An Introduction to Sociolinguistics , 2011 .

[68]  S. Jarvis A portrait of the young in the new multilingual Spain , 2010 .

[69]  R. V. Bezooijen,et al.  Lexical and orthographic distances between Germanic, Romance and Slavic languages and their relationship to geographic distance (Wilbert Heeringa, Jelena Golubovic, Charlotte Gooskens, Anja Schüppert, Femke Swarte & Stefanie Voigt) , 2013 .

[70]  Haitao Liu,et al.  Language clustering with word co-occurrence networks based on parallel texts , 2013 .

[71]  Timothy Baldwin,et al.  Automatic Language Identification in Texts: A Survey , 2018, J. Artif. Intell. Res..

[72]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[73]  Simon Kirby,et al.  Measuring Language Divergence by Intra-Lexical Comparison , 2006, ACL.

[74]  Elke Teich,et al.  An Information-Theoretic Approach to Modeling Diachronic Change in Scientific English , 2018, From Data to Evidence in English Language Research.

[75]  Simon J. Greenhill,et al.  A Bayesian phylogenetic study of the Dravidian language family , 2018, Royal Society Open Science.

[76]  Lidia S. Chao,et al.  Quality Estimation for Machine Translation Using the Joint Method of Evaluation Criteria and Statistical Modeling , 2013, WMT@ACL.

[77]  Fernando Venâncio O castelhano como vernáculo do português , 2014 .

[78]  Ricardo Carballo Calero Historia da literatura galega contemporánea , 1975 .

[79]  José-Luis Sancho-Gómez,et al.  Using Maximum Entropy Models to Discriminate between Similar Languages and Varieties , 2014, VarDial@COLING.

[80]  Anil Kumar Singh,et al.  Can Corpus Based Measures be Used for Comparative Study of Languages? , 2007, SIGMORPHON.

[81]  Henrique Monteagudo Romero Historia social da lingua galega : idioma, sociedade e cultura a través do tempo , 1999 .

[82]  Grzegorz Kondrak,et al.  N-Gram Similarity and Distance , 2005, SPIRE.