Evaluation of Several Phonetic Similarity Algorithms on the Task of Cognate Identification

We investigate the problem of measuring phonetic similarity, focusing on the identification of cognates, words of the same origin in different languages. We compare representatives of two principal approaches to computing phonetic similarity: manually-designed metrics, and learning algorithms. In particular, we consider a stochastic transducer, a Pair HMM, several DBN models, and two constructed schemes. We test those approaches on the task of identifying cognates among Indoeuropean languages, both in the supervised and unsupervised context. Our results suggest that the averaged context DBN model and the Pair HMM achieve the highest accuracy given a large training set of positive examples.

[1]  J. Kruskal,et al.  An Indoeuropean classification : a lexicostatistical experiment , 1992 .

[2]  Brett Kessler,et al.  Phonetic comparison algorithms , 2005 .

[3]  Jeff Mielke Modeling Distinctive Feature Emergence , 2005 .

[4]  David Yarowsky,et al.  Multipath Translation Lexicon Induction via Bridge Languages , 2001, NAACL.

[5]  Karim Filali,et al.  A Dynamic Bayesian Framework to Model Context and Memory in Edit Distance Learning: An Application to Pronunciation Classification , 2005, ACL.

[6]  Grzegorz Kondrak,et al.  A New Algorithm for the Alignment of Phonetic Sequences , 2000, ANLP.

[7]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[8]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[9]  Grzegorz Kondrak Determining Recurrent Sound Correspondences by Inducing Translation Models , 2002, COLING.

[10]  Grzegorz Kondrak,et al.  Computing Word Similarity and Identifying Cognates with Pair Hidden Markov Models , 2005, CoNLL.

[11]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  John Laver,et al.  Principles of Phonetics: Principles of transcription , 1994 .

[13]  I. Dan Melamed,et al.  Models of translation equivalence among words , 2000, CL.

[14]  Geoffrey Zweig,et al.  The graphical models toolkit: An open source software system for speech and time-series processing , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Brett Kessler,et al.  Book Reviews: The Significance of Word Lists , 2001, CL.

[16]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.