Multilingual Cognate Identification using Integer Linear Programming

The identification of cognates in natural languages is a crucial part of automatic translation lexicon construction and other multilingual lexical tasks. We present new methods for multilingual cognate identification using the global inference framework of Integer Linear Programming. While previous approaches to cognate identification have focused on pairs of natural languages, we provide a methodology that directly forms sets of cognates across groups of languages. We show improvements over simple clustering techniques that do not inherently consider the transitivity of cognate relations. Furthermore, we show that formulations that jointly link cognates across groups of natural languages achieve higher performance than traditional pairwise approaches. We also describe applications of our technique to other important problems in multilingual natural language processing.

[1]  I. Dan Melamed,et al.  Bitext Maps and Alignment via Pattern Recognition , 1999, CL.

[2]  Tetsuya Ishikawa,et al.  Extracting Loanwords from Mongolian Corpora and Producing a Japanese-Mongolian Bilingual Dictionary , 2006, ACL.

[3]  Alexandra L. Uitdenbogerd Readability of French as a foreign language and its uses , 2005 .

[4]  Koby Crammer,et al.  Online Large-Margin Training of Dependency Parsers , 2005, ACL.

[5]  Michael P. Oakes,et al.  Computer Estimation of Vocabulary in a Protolanguage from Word Lists in Four Daughter Languages , 2000, J. Quant. Linguistics.

[6]  Jörg Tiedemann,et al.  Automatic Construction of Weighted String Similarity Measures , 1999, EMNLP.

[7]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[8]  Daniel Marcu,et al.  Cognates Can Improve Statistical Translation Models , 2003, NAACL.

[9]  J. Kruskal,et al.  An Indoeuropean classification : a lexicostatistical experiment , 1992 .

[10]  Grzegorz Kondrak,et al.  Alignment-Based Discriminative String Similarity , 2007, ACL.

[11]  Grzegorz Kondrak,et al.  Creating a Comparative Dictionary of Totonac-Tepehua , 2007, SIGMORPHON.

[12]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[13]  M. Rey Learning a Translation Lexicon from Monolingual Corpora , 2002 .

[14]  Mirella Lapata,et al.  Aggregation via Set Partitioning for Natural Language Generation , 2006, NAACL.

[15]  Ben Taskar,et al.  A Discriminative Matching Approach to Word Alignment , 2005, HLT.

[16]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[17]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[18]  Michel Simard,et al.  Using cognates to align sentences in bilingual corpora , 1993, TMI.

[19]  John B. Lowe,et al.  The Reconstruction Engine: A Computer Implementation of the Comparative Method , 1994, CL.

[20]  Ben Taskar,et al.  Word Alignment via Quadratic Assignment , 2006, NAACL.

[21]  David Yarowsky,et al.  Multipath Translation Lexicon Induction via Bridge Languages , 2001, NAACL.

[22]  Pascal Denis,et al.  Joint Determination of Anaphoricity and Coreference Resolution using Integer Programming , 2007, NAACL.

[23]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[24]  Viktor Pekar,et al.  Automatic Detection of Orthographics Cues for Cognate Recognition , 2006, LREC.

[25]  Dan Roth,et al.  A Linear Programming Formulation for Global Inference in Natural Language Tasks , 2004, CoNLL.

[26]  Grzegorz Kondrak,et al.  Identifying Cognates by Phonetic and Semantic Similarity , 2001, NAACL.

[27]  Dan Roth,et al.  Weakly Supervised Named Entity Transliteration and Discovery from Multilingual Comparable Corpora , 2006, ACL.

[28]  David Yarowsky,et al.  Inducing Translation Lexicons via Diverse Similarity Measures and Bridge Languages , 2002, CoNLL.