A Dirichlet Process Mixture Based Name Origin Clustering and Alignment Model for Transliteration

In machine transliteration, it is common that the transliterated names in the target language come from multiple language origins. A conventional maximum likelihood based single model can not deal with this issue very well and often suffers from overfitting. In this paper, we exploit a coupled Dirichlet process mixture model (cDPMM) to address overfitting and names multiorigin cluster issues simultaneously in the transliteration sequence alignment step over the name pairs. After the alignment step, the cDPMM clusters name pairs into many groups according to their origin information automatically. In the decoding step, in order to use the learned origin information sufficiently, we use a cluster combination method (CCM) to build clustering-specific transliteration models by combining small clusters into large ones based on the perplexities of name language and transliteration model, which makes sure each origin cluster has enough data for training a transliteration model. On the three different Western-Chinese multiorigin names corpora, the cDPMM outperforms two state-of-the-art baseline models in terms of both the top-1 accuracy and mean F-score, and furthermore the CCM significantly improves the cDPMM.

[1]  Haizhou Li,et al.  Report of NEWS 2016 Machine Transliteration Shared Task , 2016, NEWS@ACM.

[2]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[3]  Chris Dyer,et al.  A Gibbs Sampler for Phrasal Synchronous Grammar Induction , 2009, ACL.

[4]  Sanjeev Khudanpur,et al.  Transliteration of Proper Names in Cross-Lingual Information Retrieval , 2003, NER@ACL.

[5]  Alexander H. Waibel,et al.  Clustering and Classifying Person Names by Origin , 2005, AAAI.

[6]  Wei Gao,et al.  Phoneme-Based Transliteration of Foreign Names for OOV Problem , 2004, IJCNLP.

[7]  Satoshi Sekine,et al.  Latent Semantic Transliteration using Dirichlet Mixture , 2012, NEWS@ACL.

[8]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[9]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[10]  Jian Su,et al.  A Joint Source-Channel Model for Machine Transliteration , 2004, ACL.

[11]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[12]  Falk Scholer,et al.  Machine transliteration survey , 2011, ACM Comput. Surv..

[13]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[14]  Eiichiro Sumita,et al.  Using Features from a Bilingual Alignment Model in Transliteration Mining , 2011, NEWS@IJCNLP.

[15]  Hitoshi Isahara,et al.  A machine transliteration model based on correspondence between graphemes and phonemes , 2006, TALIP.

[16]  Key-Sun Choi,et al.  Machine Learning Based English-to-Korean Transliteration Using Grapheme and Phoneme Information , 2005, IEICE Trans. Inf. Syst..

[17]  Grzegorz Kondrak,et al.  Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion , 2007, NAACL.

[18]  Haizhou Li,et al.  Semantic Transliteration of Personal Names , 2007, ACL.

[19]  Eiichiro Sumita,et al.  Integrating Models Derived from non-Parametric Bayesian Co-segmentation into a Statistical Machine Transliteration System , 2011, NEWS@IJCNLP.

[20]  Satoshi Sekine,et al.  Latent Class Transliteration based on Source Language Origin , 2011, ACL.

[21]  D. Aldous Exchangeability and related topics , 1985 .

[22]  Kevin Knight,et al.  Machine Transliteration , 1997, CL.

[23]  Naonori Ueda,et al.  Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling , 2009, ACL.

[24]  Tiejun Zhao,et al.  A Tightly-coupled Unsupervised Clustering and Bilingual Alignment Model for Transliteration , 2013, ACL.

[25]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[26]  Eiichiro Sumita,et al.  A Bayesian model of bilingual segmentation for transliteration , 2010, IWSLT.

[27]  Eric Brill,et al.  Automatically Harvesting Katakana-English Term Pairs from Search Engine Query Logs , 2001, NLPRS.