We introduce a new probabilistic model for transliteration that performs significantly better than previous approaches, is language-agnostic, requiring no knowledge of the source or target languages, and is capable of both generation (creating the most likely transliteration of a source word) and discovery (selecting the most likely transliteration from a list of candidate words). Our experimental results demonstrate improved accuracy over the existing state-of-the-art by more than 10% in Chinese, Hebrew and Russian. While past work has commonly made use of fixed-size n-gram features along with more traditional models such as HMM or Perceptron, we utilize an intuitive notion of "productions", where each source word can be segmented into a series of contiguous, non-overlapping substrings of any size, each of which independently transliterates to a substring in the target language with a given probability. To learn these parameters, we employ Expectation-Maximization (EM), with the alignment between substrings in the source and target word training pairs as our latent data. Despite the size of the parameter space and the 2(|w|-1) possible segmentations to consider for each word, by using dynamic programming each iteration of EM takes O(m^6 * n) time, where m is the length of the longest word in the data and n is the number of word pairs, and is very fast in practice. Furthermore, discovering transliterations takes only O(m^4 * w) time, where w is the number of candidate words to choose from, and generating a transliteration takes O(m2 * k2) time, where k is a pruning constant (we used a value of 100). Additionally, we are able to obtain training examples in an unsupervised fashion from Wikipedia by using a relatively simple algorithm to filter potential word pairs.
[1]
Kevin Knight,et al.
Machine Transliteration
,
1997,
CL.
[2]
Fred Popowich,et al.
Automatic Transliteration of Proper Nouns from Arabic to English
,
2006,
BCS.
[3]
Grzegorz Kondrak,et al.
Bootstrapping a Stochastic Transducer for Arabic-English Transliteration Extraction
,
2007,
ACL.
[4]
D. Rubin,et al.
Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper
,
1977
.
[5]
Leah S. Larkey,et al.
Statistical transliteration for english-arabic cross language information retrieval
,
2003,
CIKM '03.
[6]
Ming-Wei Chang,et al.
Unsupervised Constraint Driven Learning For Transliteration Discovery
,
2009,
NAACL.
[7]
Dan Roth,et al.
Transliteration as Constrained Optimization
,
2008,
EMNLP.
[8]
Jian Su,et al.
A Joint Source-Channel Model for Machine Transliteration
,
2004,
ACL.
[9]
Kevin Knight,et al.
Translating Names and Technical Terms in Arabic Text
,
1998,
SEMITIC@COLING.
[10]
Tao Tao,et al.
Unsupervised Named Entity Transliteration Using Temporal and Phonetic Correlation
,
2006,
EMNLP.
[11]
Hermann Ney,et al.
A Systematic Comparison of Various Statistical Alignment Models
,
2003,
CL.