Computing Word Similarity and Identifying Cognates with Pair Hidden Markov Models

We present a system for computing similarity between pairs of words. Our system is based on Pair Hidden Markov Models, a variation on Hidden Markov Models that has been used successfully for the alignment of biological sequences. The parameters of the model are automatically learned from training data that consists of word pairs known to be similar. Our tests focus on the identification of cognates --- words of common origin in related languages. The results show that our system outperforms previously proposed techniques.

[1]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Justin Zobel,et al.  Phonetic string matching: lessons from information retrieval , 1996, SIGIR '96.

[3]  Philipp Koehn,et al.  Knowledge Sources for Word-Level Translation Models , 2001, EMNLP.

[4]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[5]  Chris Brew,et al.  Word-Pair Extraction for Lexicography , 1996 .

[6]  Stefan Evert,et al.  Significance tests for the evaluation of ranking methods , 2004, COLING.

[7]  T. Speed,et al.  Biological Sequence Analysis , 1998 .

[8]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[9]  Kalervo Järvelin,et al.  Fuzzy translation of cross-lingual spelling variants , 2003, SIGIR.

[10]  J. Kruskal,et al.  An Indoeuropean classification : a lexicostatistical experiment , 1992 .

[11]  Jörg Tiedemann,et al.  Combining Clues for Word Alignment , 2003, EACL.

[12]  Brett Kessler,et al.  Book Reviews: The Significance of Word Lists , 2001, CL.

[13]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[14]  Grzegorz Kondrak,et al.  A New Algorithm for the Alignment of Phonetic Sequences , 2000, ANLP.

[15]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[16]  David Yarowsky,et al.  Multipath Translation Lexicon Induction via Bridge Languages , 2001, NAACL.

[17]  Michael A. Covington Alignment of Multiple Languages for Historical Comparison , 1998, COLING-ACL.

[18]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[19]  I. Dan Melamed,et al.  Bitext Maps and Alignment via Pattern Recognition , 1999, CL.

[20]  Grzegorz Kondrak,et al.  Identification of Confusable Drug Names: A New Approach and Evaluation Methodology , 2004, COLING.

[21]  John Nerbonne,et al.  Linguistic Variation and Computation (Invited talk) , 2003, EACL.

[22]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[23]  Alexander Clark,et al.  Learning Morphology with Pair Hidden Markov Models , 2001, ACL.

[24]  Michael P. Oakes,et al.  Computer Estimation of Vocabulary in a Protolanguage from Word Lists in Four Daughter Languages , 2000, J. Quant. Linguistics.

[25]  Sei-ichiro Kamata,et al.  A New Algorithm for , 1999 .

[26]  Jörg Tiedemann,et al.  Automatic Construction of Weighted String Similarity Measures , 1999, EMNLP.