Mining Transliterations from Wikipedia using Dynamic Bayesian Networks

Transliteration mining is aimed at building high quality multi-lingual named entity (NE) lexicons for improving performance in various Natural Language Processing (NLP) tasks including Machine Translation (MT) and Cross Language Information Retrieval (CLIR). In this paper, we apply two Dynamic Bayesian network (DBN)-based edit distance (ED) approaches in mining transliteration pairs from Wikipedia. Transliteration identification results on standard corpora for seven language pairs suggest that the DBN-based edit distance approaches are suitable for modeling transliteration similarity. An evaluation on mining transliteration pairs from English-Hindi and English-Tamil Wikipedia topic pairs shows that they improve transliteration mining quality over state-of-the-art approaches.

[1]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[3]  Stuart J. Russell,et al.  Dynamic bayesian networks: representation, inference and learning , 2002 .

[4]  Grzegorz Kondrak,et al.  Computing Word Similarity and Identifying Cognates with Pair Hidden Markov Models , 2005, CoNLL.

[5]  Karim Filali,et al.  A Dynamic Bayesian Framework to Model Context and Memory in Edit Distance Learning: An Application to Pronunciation Classification , 2005, ACL.

[6]  Grzegorz Kondrak,et al.  Evaluation of Several Phonetic Similarity Algorithms on the Task of Cognate Identification , 2006 .

[7]  John Nerbonne,et al.  Inducing Sound Segment Differences Using Pair Hidden Markov Models , 2007, SIGMORPHON.

[8]  Jörg Tiedemann,et al.  Pair Hidden Markov Model for Named Entity Matching , 2008, SCSS.

[9]  Min Zhang,et al.  Report of NEWS 2009 Machine Transliteration Shared Task , 2009, NEWS@IJCNLP.

[10]  Peter Nabende,et al.  Comparison of applying Pair HMMs and DBN models in Transliteration Identification , 2010 .

[11]  Haizhou Li,et al.  Whitepaper of NEWS 2010 Shared Task on Transliteration Mining , 2010, NEWS@ACL.

[12]  Peter Nabende,et al.  Applying a Dynamic Bayesian Network Framework to Transliteration Identification , 2010, LREC.

[13]  Haizhou Li,et al.  Whitepaper of NEWS 2010 Shared Task on Transliteration Generation , 2010, NEWS@ACL.

[14]  Haizhou Li,et al.  Report of NEWS 2010 Transliteration Mining Shared Task , 2010, NEWS@ACL.

[15]  Peter Nabende Mining Transliterations from Wikipedia Using Pair HMMs , 2010, NEWS@ACL.