A language-independent probabilistic model for automatic conversion between graphemic and phonemic transcription of words

In this paper we present a novel language-independent probabilistic model for automatic grapheme-to-phoneme and phoneme-to-grapheme conversion of words. In a fully unsupervised training procedure, two processes are applied; the transformation rules, which usually fail to provide the correct symbols, are eliminated, and new variable-length string transformation rules are defined improving the string transformation accuracy in the training data. In an iterative process the probabilistic transformation rules are updated in the direction of reducing the error rate of the transformed symbols. Long-term dependencies are defined automatically. Training and testing of the model was carried out on lexicon and natural language corpora of six European Languages. Accurate generalisations have been achieved in all experiments for both transformation directions using a relative small number of defined rules in the training procedure. It is demonstrated that the variable-length probabilistic rules are sufficiently effective for describing bi-directional transcription.