Data Driven Models for Language Evolution

Natural languages that originate from a common ancestor are genetically related, words are the core of any language and cognates are words sharing the same ancestor and etymology. Cognate identification, therefore, represents the foundation upon which the evolutionary history of languages may be discovered, while linguistic phylogenetic inference aims to estimate the genetic relationships that exist between them. In this thesis, using several techniques originally developed for biological sequence analysis, we have designed a data driven orthographic learning system for measuring string similarity and we have successfully applied it to the tasks of cognate identification and phylogenetic inference. Our system has outperformed the best comparable phonetic and orthographic cognate identification models previously reported in the literature, with results statistically significant and remarkably stable, regardless of the variation of the training dataset dimension. When applied to phylogenetic inference of the Indo-European language family, whose higher structure does not yet have consensus, our method has estimated phylogenies which are compatible with the benchmark tree and has reproduced correctly all the established major language groups and subgroups present in the dataset.

[1]  G. A. Barnard,et al.  Transmission of Information: A Statistical Theory of Communications. , 1961 .

[2]  John Nerbonne,et al.  Measuring Dialect Distance Phonetically , 1997, SIGMORPHON@EACL.

[3]  R. Graham,et al.  The steiner problem in phylogeny is NP-complete , 1982 .

[4]  Kalervo Järvelin,et al.  Fuzzy translation of cross-lingual spelling variants , 2003, SIGIR.

[5]  Luay Nakhleh,et al.  An experimental study comparing linguistic phylogenetic reconstruction methods , 2013 .

[6]  John Nerbonne,et al.  Inducing Sound Segment Differences Using Pair Hidden Markov Models , 2007, SIGMORPHON.

[7]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[8]  Raimo Anttila,et al.  An introduction to historical and comparative linguistics , 1974 .

[9]  Daniel Frynta,et al.  Cladistic analysis of languages: Indo‐European classification based on lexicostatistical data , 2003 .

[10]  J. Stephen Lansing,et al.  Computational Feature-Sensitive Reconstruction of Language Relationships: Developing the ALINE Distance for Comparative Historical Linguistic Reconstruction , 2008, J. Quant. Linguistics.

[11]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[12]  Brett Kessler,et al.  Book Reviews: The Significance of Word Lists , 2001, CL.

[13]  Edward P. Markowski,et al.  Conditions for the Effectiveness of a Preliminary Test of Variance , 1990 .

[14]  Elena Deza,et al.  Dictionary of distances , 2006 .

[15]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[16]  Marcello Barbieri,et al.  On the Origin of Language , 2010, Biosemiotics.

[17]  M. Swadesh Towards Greater Accuracy in Lexicostatistic Dating , 1955, International Journal of American Linguistics.

[18]  T. Jukes,et al.  The neutral theory of molecular evolution. , 2000, Genetics.

[19]  F. Saussure,et al.  Course in General Linguistics , 1960 .

[20]  T. Sørensen,et al.  A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons , 1948 .

[21]  Cecil H. Brown,et al.  Automated classification of the world′s languages: a description of the method and preliminary results , 2008 .

[22]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[23]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Michael P. Oakes,et al.  Computer Estimation of Vocabulary in a Protolanguage from Word Lists in Four Daughter Languages , 2000, J. Quant. Linguistics.

[25]  北京大学汉语语言学研究中心《语言学论丛》编委会,et al.  語言學論叢 = Essays on linguistics , 2009 .

[26]  George W. Adamson,et al.  The use of an association measure based on character structure to identify semantically related pairs of words and document titles , 1974, Inf. Storage Retr..

[27]  T. Speed,et al.  Biological Sequence Analysis , 1998 .

[28]  Daniel H. Huson,et al.  Phylogenetic trees based on gene content , 2004, Bioinform..

[29]  Tandy J. Warnow,et al.  Tutorial on Computational Linguistic Phylogeny , 2008, Lang. Linguistics Compass.

[30]  M. Chiani Error Detecting and Error Correcting Codes , 2012 .

[31]  Michael Cysouw,et al.  Cognate Identification and Alignment Using Practical Orthographies , 2007, SIGMORPHON.

[32]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[33]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[34]  Grzegorz Kondrak,et al.  Identification of Cognates and Recurrent Sound Correspondences in Word Lists , 2009, TAL.

[35]  Stuart J. Russell,et al.  Dynamic bayesian networks: representation, inference and learning , 2002 .

[36]  Jörg Tiedemann,et al.  Automatic Construction of Weighted String Similarity Measures , 1999, EMNLP.

[37]  James W. Minett,et al.  Vertical and horizontal transmission in language evolution , 2005 .

[38]  Daniel Frynta,et al.  Cladistic analysis of Bantu languages: a new tree based on combined lexical and grammatical data , 2006, Naturwissenschaften.

[39]  Nello Cristianini,et al.  Linguistic Phylogenetic Inference by PAM-like Matrices , 2012, J. Quant. Linguistics.

[40]  Viktor Pekar,et al.  Automatic Detection of Orthographics Cues for Cognate Recognition , 2006, LREC.

[41]  T. Warnow,et al.  INFERENCE OF DIVERGENCE TIMES AS A STATISTICAL INVERSE PROBLEM , 2004 .

[42]  Russell D. Gray,et al.  Rapid radiation, borrowing and dialect continua in the Bantu languages , 2006 .

[43]  P. Bullen Handbook of means and their inequalities , 1987 .

[44]  B. Harris Bi-text, a new concept in translation theory , 1988 .

[45]  P. Pye-Smith The Descent of Man, and Selection in Relation to Sex , 1871, Nature.

[46]  Tamir Tuller,et al.  Finding a maximum likelihood tree is hard , 2006, JACM.

[47]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[48]  Brett Kessler,et al.  Computational dialectology in Irish Gaelic , 1995, EACL.

[49]  Geoff K. Nicholls,et al.  Quantifying uncertainty in a stochastic model of vocabulary evolution , 2006 .

[50]  M. Swadesh Salish Internal Relationships , 1950, International Journal of American Linguistics.

[51]  J. Kruskal,et al.  An Indoeuropean classification : a lexicostatistical experiment , 1992 .

[52]  Grzegorz Kondrak,et al.  Computing Word Similarity and Identifying Cognates with Pair Hidden Markov Models , 2005, CoNLL.

[53]  Nello Cristianini,et al.  A STATISTICAL ANALYSIS OF LANGUAGE EVOLUTION , 2006 .

[54]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[55]  Luay Nakhleh,et al.  A comparison of phylogenetic reconstruction methods on an Indo‐European dataset , 2005 .

[56]  M. Serva,et al.  Indo-European languages tree by Levenshtein distance , 2007, 0708.2971.

[57]  J. Diamond,et al.  Farmers and Their Languages: The First Expansions , 2003, Science.

[58]  Russell F. Doolittle,et al.  Converting Amino Acid Alignment Scores into Measures of Evolutionary Time: A Simulation Study of Various Relationships , 1997, Journal of Molecular Evolution.

[59]  Filippo Petroni,et al.  Geometric representations of language taxonomies , 2009, Comput. Speech Lang..

[60]  Stefan Schulz,et al.  Cognate Mapping - A Heuristic Strategy for the Semi-Supervised Acquisition of a Spanish Lexicon from a Portuguese Seed Lexicon , 2004, COLING.

[61]  Joseph B. Kruskal,et al.  Time Warps, String Edits, and Macromolecules , 1999 .

[62]  Robin Milner,et al.  On Observing Nondeterminism and Concurrency , 1980, ICALP.

[63]  Filippo Petroni,et al.  Language distance and tree reconstruction , 2008 .

[64]  Donald A. Ringe join On Calculating the Factor of Chance in Language Comparison , 1992 .

[65]  J. Huelsenbeck,et al.  Potential applications and pitfalls of Bayesian inference of phylogeny. , 2002, Systematic biology.

[66]  Simon J. Greenhill,et al.  Languages Evolve in Punctuational Bursts , 2008, Science.

[67]  Michael A. Covington,et al.  An Algorithm to Align Words for Historical Comparison , 1996, Comput. Linguistics.

[68]  Quentin D Atkinson,et al.  Curious parallels and curious connections--phylogenetic thinking in biology and historical linguistics. , 2005, Systematic biology.

[69]  Geoff K. Nicholls,et al.  Missing data in a stochastic Dollo model for binary trait data, and its application to the dating of Proto‐Indo‐European , 2011 .

[70]  C. Borror An Introduction to Statistical Methods and Data Analysis, 5th Ed. , 2002 .

[71]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[72]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[73]  Nello Cristianini,et al.  String Similarity Measures and Pam-like Matrices for Cognate Identification , 2010 .

[74]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[75]  J. A. Studier,et al.  A note on the neighbor-joining algorithm of Saitou and Nei. , 1988, Molecular biology and evolution.

[76]  Quentin D. Atkinson,et al.  How old is the Indo-European language family? : illumination or more moths to the flame? , 2006 .

[77]  Grzegorz Kondrak,et al.  Identification of Confusable Drug Names: A New Approach and Evaluation Methodology , 2004, COLING.

[78]  Kalervo Järvelin,et al.  Proceedings of Sheffield SIGIR, 2004, July 25th-29th : the Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in information Retrieval , 2004 .

[79]  G. Nicholls,et al.  FROM WORDS TO DATES: WATER INTO WINE, MATHEMAGIC OR PHYLOGENETIC INFERENCE? , 2005 .

[80]  Cecil H. Brown,et al.  Adding typology to lexicostatistics: A combined approach to language classification , 2009 .

[81]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[82]  Roger K. Moore Computer Speech and Language , 1986 .

[83]  Peter Nabende,et al.  Transliteration System Using Pair HMM with Weighted FSTs , 2009, NEWS@IJCNLP.

[84]  R. Gray,et al.  Are Accurate Dates an Intractable Problem for Historical Linguistics , 2006 .

[85]  Harold L. Somers Aligning Phonetic Segments for Children's Articulation Assessment , 1999, Comput. Linguistics.

[86]  Simon J. Greenhill,et al.  The shape and tempo of language evolution , 2010, Proceedings of the Royal Society B: Biological Sciences.

[87]  Michael A. Covington Alignment of Multiple Languages for Historical Comparison , 1998, COLING-ACL.

[88]  M. Swadesh Lexico-Statistical Dating of Prehistoric Ethnic Contacts , 1952 .

[89]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[90]  Geoff K. Nicholls,et al.  Missing data in a stochastic Dollo model for cognate data, and its application to the dating of Proto-Indo-European , 2009 .

[91]  Karim Filali,et al.  A Dynamic Bayesian Framework to Model Context and Memory in Edit Distance Learning: An Application to Pronunciation Classification , 2005, ACL.

[92]  Grzegorz Kondrak,et al.  Evaluation of Several Phonetic Similarity Algorithms on the Task of Cognate Identification , 2006 .

[93]  Simon Kirby,et al.  Measuring Language Divergence by Intra-Lexical Comparison , 2006, ACL.

[94]  Jacques B. M. Guy An Algorithm for Identifying Cognates in Bilingual Wordlists and its Applicability to Machine Translation , 1994, J. Quant. Linguistics.

[95]  Heinrich Wagner,et al.  Linguistic Atlas and Survey of Irish Dialects , 1958 .

[96]  Diana Inkpen,et al.  Automatic Identification of Cognates and False Friends in French and English , 2005 .

[97]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[98]  Jan Czekanowski,et al.  Zarys metod statystycznych : w zastosowaniu do antropologii , 1913 .

[99]  Ferdinand de Saussure Course in General Linguistics , 1916 .

[100]  I. Dan Melamed,et al.  Automatic Evaluation and Uniform Filter Cascades for Inducing N-Best Translation Lexicons , 1995, VLC@ACL.

[101]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[102]  P. Underhill,et al.  African Origin of Modern Humans in East Asia: A Tale of 12,000 Y Chromosomes , 2001, Science.

[103]  Sarah L. Nesbeitt Ethnologue: Languages of the World , 1999 .

[104]  Chris Brew,et al.  Word-Pair Extraction for Lexicography , 1996 .

[105]  M. Pagel,et al.  Frequency of word-use predicts rates of lexical evolution throughout Indo-European history , 2007, Nature.

[106]  R. Lyman Ott.,et al.  An introduction to statistical methods and data analysis , 1977 .

[107]  Søren Wichmann,et al.  How to use typological databases in historical linguistic research , 2007 .

[108]  Michel Simard,et al.  Using cognates to align sentences in bilingual corpora , 1993, TMI.

[109]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[110]  Michael R. Fellows,et al.  Two Strikes Against Perfect Phylogeny , 1992, ICALP.

[111]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[112]  Grzegorz Kondrak,et al.  N-Gram Similarity and Distance , 2005, SPIRE.

[113]  Grzegorz Kondrak,et al.  A New Algorithm for the Alignment of Phonetic Sequences , 2000, ANLP.

[114]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[115]  Graeme Hirst,et al.  Algorithms for language reconstruction , 2002 .

[116]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[117]  Søren Wichmann,et al.  Explorations in automated language classification , 2008 .

[118]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[119]  I. Dan Melamed,et al.  Bitext Maps and Alignment via Pattern Recognition , 1999, CL.

[120]  Anthony Arlotto,et al.  Introduction to Historical Linguistics , 1971 .

[121]  Russell D. Gray,et al.  Language trees support the express-train sequence of Austronesian expansion , 2000, Nature.

[122]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[123]  Philipp Koehn,et al.  Knowledge Sources for Word-Level Translation Models , 2001, EMNLP.

[124]  Jared M. Diamond,et al.  Express train to Polynesia , 1988, Nature.

[125]  Daniel Marcu,et al.  Cognates Can Improve Statistical Translation Models , 2003, NAACL.

[126]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[127]  L. Cavalli-Sforza Genes, peoples and languages. , 1991, Scientific American.

[128]  Lyle Campbell,et al.  Historical Linguistics: An Introduction , 1991 .

[129]  Grzegorz Kondrak Cognates and Word Alignment in Bitexts , 2005, MTSUMMIT.

[130]  A. Bennett The Origin of Species by means of Natural Selection; or the Preservation of Favoured Races in the Struggle for Life , 1872, Nature.

[131]  David Yarowsky,et al.  Multipath Translation Lexicon Induction via Bridge Languages , 2001, NAACL.

[132]  R. Gray,et al.  Language-tree divergence times support the Anatolian theory of Indo-European origin , 2003, Nature.

[133]  Geoff K. Nicholls,et al.  Dated ancestral trees from binary trait data and their application to the diversification of languages , 2007, 0711.1874.

[134]  Nello Cristianini,et al.  Robustness and Statistical Significance of Pam-like Matrices for Cognate Identification , 2010 .

[135]  P. Forster,et al.  Toward a phylogenetic chronology of ancient Gaulish, Celtic, and Indo-European , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[136]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[137]  R. Bellman Dynamic programming. , 1957, Science.

[138]  Tandy Warnow,et al.  Indo‐European and Computational Cladistics , 2002 .

[139]  Kenneth Ward Church Char_align: A Program for Aligning Parallel Texts at the Character Level , 1993, ACL.