论文信息 - Data Driven Models for Language Evolution

Data Driven Models for Language Evolution

Natural languages that originate from a common ancestor are genetically related, words are the core of any language and cognates are words sharing the same ancestor and etymology. Cognate identification, therefore, represents the foundation upon which the evolutionary history of languages may be discovered, while linguistic phylogenetic inference aims to estimate the genetic relationships that exist between them. In this thesis, using several techniques originally developed for biological sequence analysis, we have designed a data driven orthographic learning system for measuring string similarity and we have successfully applied it to the tasks of cognate identification and phylogenetic inference. Our system has outperformed the best comparable phonetic and orthographic cognate identification models previously reported in the literature, with results statistically significant and remarkably stable, regardless of the variation of the training dataset dimension. When applied to phylogenetic inference of the Indo-European language family, whose higher structure does not yet have consensus, our method has estimated phylogenies which are compatible with the benchmark tree and has reproduced correctly all the established major language groups and subgroups present in the dataset.

Antonella Delmestri | A. Delmestri

[1] G. A. Barnard,et al. Transmission of Information: A Statistical Theory of Communications. , 1961 .

[2] John Nerbonne,et al. Measuring Dialect Distance Phonetically , 1997, SIGMORPHON@EACL.

[3] R. Graham,et al. The steiner problem in phylogeny is NP-complete , 1982 .

[4] Kalervo Järvelin,et al. Fuzzy translation of cross-lingual spelling variants , 2003, SIGIR.

[5] Luay Nakhleh,et al. An experimental study comparing linguistic phylogenetic reconstruction methods , 2013 .

[6] John Nerbonne,et al. Inducing Sound Segment Differences Using Pair Hidden Markov Models , 2007, SIGMORPHON.

[7] O. Gotoh. An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[8] Raimo Anttila,et al. An introduction to historical and comparative linguistics , 1974 .

[9] Daniel Frynta,et al. Cladistic analysis of languages: Indo‐European classification based on lexicostatistical data , 2003 .

[10] J. Stephen Lansing,et al. Computational Feature-Sensitive Reconstruction of Language Relationships: Developing the ALINE Distance for Comparative Historical Linguistic Reconstruction , 2008, J. Quant. Linguistics.

[11] N. Saitou,et al. The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[12] Brett Kessler,et al. Book Reviews: The Significance of Word Lists , 2001, CL.

[13] Edward P. Markowski,et al. Conditions for the Effectiveness of a Preliminary Test of Variance , 1990 .

[14] Elena Deza,et al. Dictionary of distances , 2006 .

[15] Gonzalo Navarro,et al. A guided tour to approximate string matching , 2001, CSUR.

[16] Marcello Barbieri,et al. On the Origin of Language , 2010, Biosemiotics.

[17] M. Swadesh. Towards Greater Accuracy in Lexicostatistic Dating , 1955, International Journal of American Linguistics.

[18] T. Jukes,et al. The neutral theory of molecular evolution. , 2000, Genetics.

[19] F. Saussure,et al. Course in General Linguistics , 1960 .

[20] T. Sørensen,et al. A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons , 1948 .

[21] Cecil H. Brown,et al. Automated classification of the world′s languages: a description of the method and preliminary results , 2008 .

[22] Vladimir I. Levenshtein,et al. Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[23] S. Henikoff,et al. Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[24] Michael P. Oakes,et al. Computer Estimation of Vocabulary in a Protolanguage from Word Lists in Four Daughter Languages , 2000, J. Quant. Linguistics.

[25] 北京大学汉语语言学研究中心《语言学论丛》编委会,et al. 語言學論叢 = Essays on linguistics , 2009 .

[26] George W. Adamson,et al. The use of an association measure based on character structure to identify semantically related pairs of words and document titles , 1974, Inf. Storage Retr..

[27] T. Speed,et al. Biological Sequence Analysis , 1998 .

[28] Daniel H. Huson,et al. Phylogenetic trees based on gene content , 2004, Bioinform..

[29] Tandy J. Warnow,et al. Tutorial on Computational Linguistic Phylogeny , 2008, Lang. Linguistics Compass.

[30] M. Chiani. Error Detecting and Error Correcting Codes , 2012 .

[31] Michael Cysouw,et al. Cognate Identification and Alignment Using Practical Orthographies , 2007, SIGMORPHON.

[32] M. O. Dayhoff,et al. Atlas of protein sequence and structure , 1965 .

[33] M. O. Dayhoff. A model of evolutionary change in protein , 1978 .

[34] Grzegorz Kondrak,et al. Identification of Cognates and Recurrent Sound Correspondences in Word Lists , 2009, TAL.

[35] Stuart J. Russell,et al. Dynamic bayesian networks: representation, inference and learning , 2002 .