Compression of exception lexicons for small footprint grapheme-to-phoneme conversion

We present a method to reduce the memory footprint of a grapheme-to-phoneme conversion (G2P) module, without sacrificing accuracy. Since the G2P module is typically not 100% correct, it is common to augment the system with an exception lexicon - a list of words which the G2P does not handle correctly (and for which we require correct pronunciations), along with their corrected pronunciation. Since the size of the exception lexicon is one of the major limiting factors in reducing the overall size of the G2P module, we try to compress the exception lexicon. We suggest a novel compression method which is closely tied to the G2P conversion method. The idea behind this compression is that, even for words which are not transduced correctly, the decision trees generate a phonetic transcription which is close to the correct one. Therefore, it is sufficient to store only the correction in the exception lexicon. The correction information is represented in terms of corrections to the transduction process; it is thus able to take advantage of the knowledge gained from the training data regarding the probabilities of different corrections, and is used to obtain more efficient compression. An experiment showed that, by using this method, an exception pronunciation can be represented, on average, with less than 4 bits (a compression factor of 7, compared to the baseline representation).