Improved Reconstruction of Protolanguage Word Forms

We present an unsupervised approach to reconstructing ancient word forms. The present work addresses three limitations of previous work. First, previous work focused on faithfulness features, which model changes between successive languages. We add markedness features, which model well-formedness within each language. Second, we introduce universal features, which support generalizations across languages. Finally, we increase the number of languages to which these methods can be applied by an order of magnitude by using improved inference methods. Experiments on the reconstruction of Proto-Oceanic, Proto-Malayo-Javanic, and Classical Latin show substantial reductions in error rate, giving the best results to date.

[1]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[2]  Stanton P. Durham,et al.  An Application of Computer Programming to the Reconstruction of a Proto-Language , 1969, COLING.

[3]  Bernd Nothofer,et al.  The reconstruction of Proto-Malayo-Javanic , 1975 .

[4]  Charles L. Eastlack Iberochange: A program to simulate systematic sound change in Ibero-Romance , 1977 .

[5]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[6]  H. H. Hock Principles of historical linguistics , 1986 .

[7]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[8]  R. Blust CENTRAL AND CENTRAL- EASTERN MALAYO-POLYNESIAN , 1993 .

[9]  L. Tierney Markov Chains for Exploring Posterior Distributions , 1994 .

[10]  John B. Lowe,et al.  The Reconstruction Engine: A Computer Implementation of the Comparative Method , 1994, CL.

[11]  Michael A. Covington Alignment of Multiple Languages for Historical Comparison , 1998, COLING-ACL.

[12]  B. Joseph,et al.  Historical Linguistics , 1999 .

[13]  Grzegorz Kondrak,et al.  A New Algorithm for the Alignment of Phonetic Sequences , 2000, ANLP.

[14]  Michael P. Oakes,et al.  Computer Estimation of Vocabulary in a Protolanguage from Word Lists in Four Daughter Languages , 2000, J. Quant. Linguistics.

[15]  Ian Holmes,et al.  Evolutionary HMMs: a Bayesian approach to multiple alignment , 2001, Bioinform..

[16]  Graeme Hirst,et al.  Algorithms for language reconstruction , 2002 .

[17]  Mark Johnson,et al.  Learning OT constraint rankings using a maximum entropy model , 2003 .

[18]  Yun S. Song,et al.  An Efficient Algorithm for Statistical Multiple Alignment on Arbitrary Phylogenetic Trees , 2003, J. Comput. Biol..

[19]  Stanley F. Chen,et al.  Conditional and joint models for grapheme-to-phoneme conversion , 2003, INTERSPEECH.

[20]  P. Smolensky,et al.  Optimality Theory: Constraint Interaction in Generative Grammar , 2004 .

[21]  Colin Wilson,et al.  Learning Phonology With Substantive Bias: An Experimental and Computational Study of Velar Palatalization , 2006, Cogn. Sci..

[22]  Dan Klein,et al.  A Probabilistic Approach to Language Change , 2007, NIPS.

[23]  I. Holmes,et al.  Tools for simulating evolution of aligned genomic regions with integrated parameter estimation , 2008, Genome Biology.

[24]  Markus Dreyer,et al.  Latent-Variable Modeling of String Transductions with Finite-State Methods , 2008, EMNLP.

[25]  Simon J. Greenhill,et al.  The Austronesian Basic Vocabulary Database: From Bioinformatics to Lexomics , 2008, Evolutionary bioinformatics online.

[26]  Dan Klein,et al.  Efficient Inference in Phylogenetic InDel Trees , 2008, NIPS.