论文信息 - Ab Initio: Automatic Latin Proto-word Reconstruction

Ab Initio: Automatic Latin Proto-word Reconstruction

Proto-word reconstruction is central to the study of language evolution. It consists of recreating the words in an ancient language from its modern daughter languages. In this paper we investigate automatic word form reconstruction for Latin proto-words. Having modern word forms in multiple Romance languages (French, Italian, Spanish, Portuguese and Romanian), we infer the form of their common Latin ancestors. Our approach relies on the regularities that occurred when the Latin words entered the modern languages. We leverage information from all modern languages, building an ensemble system for proto-word reconstruction. We use conditional random fields for sequence labeling, but we conduct preliminary experiments with recurrent neural networks as well. We apply our method on multiple datasets, showing that our method improves on previous results, having also the advantage of requiring less input data, which is essential in historical linguistics, where resources are generally scarce.

Liviu P. Dinu | Alina Maria Ciobanu

[1] Grzegorz Kondrak,et al. Multiple Word Alignment with Profile Hidden Markov Models , 2009, HLT-NAACL.

[2] Michael P. Oakes,et al. Computer Estimation of Vocabulary in a Protolanguage from Word Lists in Four Daughter Languages , 2000, J. Quant. Linguistics.

[3] Liviu P. Dinu,et al. Automatic Detection of Cognates Using Orthographic Alignment , 2014, ACL.

[4] Iryna Gurevych,et al. Cognate Production using Character-based Machine Translation , 2013, IJCNLP.

[5] Andrew Meade,et al. Ultraconserved words point to deep language ancestry across Eurasia , 2013, Proceedings of the National Academy of Sciences.

[6] Dan Klein,et al. Automated reconstruction of ancient languages using probabilistic models of sound change , 2013, Proceedings of the National Academy of Sciences.

[7] Lars Borin,et al. Comparative Evaluation of String Similarity Measures for Automatic Language Classification , 2015, Sequences in Language and Text.

[8] April McMahon,et al. Swadesh sublists and the benefits of borrowing: An Andean case study , 2005 .

[9] Paul Heggarty. Beyond lexicostatistics: How to get more out of `word list' comparisons , 2010 .

[10] Alina Maria Ciobanu. Sequence Labeling for Cognate Production , 2016, KES.

[11] Robert P. W. Duin,et al. Limits on the majority vote accuracy in classifier fusion , 2003, Pattern Analysis & Applications.