Automated lexical adaptation and speaker clustering based on pronunciation habits for non-native speech recognition

This paper describes a method to improve speech recognition for non-native speech in a spoken dialogue system. Based on very general rules about possible vocalic substitutions, the frequency of occurrence of each substitution in different phonetic contexts is estimated on a small set of recordings. The most frequently observed substitutions are applied to the lexicon of the recognizer. Speakers in the training set are automatically clustered according to their preferred phonetic variants, and a specific lexicon is built for each cluster. Acoustic adaptation is also performed on each cluster. Experiments show that lexical adaptation provides a relative WER reduction over acoustic adaptation alone. Lexical clustering can further reduce WER if the system can reliably select the cluster best matching each input utterance.