论文信息 - Learning better transliterations

Learning better transliterations

We introduce a new probabilistic model for transliteration that performs significantly better than previous approaches, is language-agnostic, requiring no knowledge of the source or target languages, and is capable of both generation (creating the most likely transliteration of a source word) and discovery (selecting the most likely transliteration from a list of candidate words). Our experimental results demonstrate improved accuracy over the existing state-of-the-art by more than 10% in Chinese, Hebrew and Russian. While past work has commonly made use of fixed-size n-gram features along with more traditional models such as HMM or Perceptron, we utilize an intuitive notion of "productions", where each source word can be segmented into a series of contiguous, non-overlapping substrings of any size, each of which independently transliterates to a substring in the target language with a given probability. To learn these parameters, we employ Expectation-Maximization (EM), with the alignment between substrings in the source and target word training pairs as our latent data. Despite the size of the parameter space and the 2(|w|-1) possible segmentations to consider for each word, by using dynamic programming each iteration of EM takes O(m^6 * n) time, where m is the length of the longest word in the data and n is the number of word pairs, and is very fast in practice. Furthermore, discovering transliterations takes only O(m^4 * w) time, where w is the number of candidate words to choose from, and generating a transliteration takes O(m2 * k2) time, where k is a pruning constant (we used a value of 100). Additionally, we are able to obtain training examples in an unsupervised fashion from Wikipedia by using a relatively simple algorithm to filter potential word pairs.

Dan Roth | Jeff Pasternack | D. Roth | Jeff Pasternack

[1] Kevin Knight,et al. Machine Transliteration , 1997, CL.

[2] Fred Popowich,et al. Automatic Transliteration of Proper Nouns from Arabic to English , 2006, BCS.

[3] Grzegorz Kondrak,et al. Bootstrapping a Stochastic Transducer for Arabic-English Transliteration Extraction , 2007, ACL.

[4] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5] Leah S. Larkey,et al. Statistical transliteration for english-arabic cross language information retrieval , 2003, CIKM '03.

[6] Ming-Wei Chang,et al. Unsupervised Constraint Driven Learning For Transliteration Discovery , 2009, NAACL.

[7] Dan Roth,et al. Transliteration as Constrained Optimization , 2008, EMNLP.

[8] Jian Su,et al. A Joint Source-Channel Model for Machine Transliteration , 2004, ACL.

[9] Kevin Knight,et al. Translating Names and Technical Terms in Arabic Text , 1998, SEMITIC@COLING.

[10] Tao Tao,et al. Unsupervised Named Entity Transliteration Using Temporal and Phonetic Correlation , 2006, EMNLP.

[11] Hermann Ney,et al. A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.