Improvements in Non-Verbal Cue Identification Using Multilingual Phone Strings

Today's state-of-the-art front-ends for multilingual speech-to-speech translation systems apply monolingual speech recognizers trained for a single language and/or accent. The monolingual speech engine is usually adaptable to an unknown speaker over time using unsupervised training methods; however, if the speaker was seen during training, their specialized acoustic model will be applied, since it achieves better performance. In order to make full use of specialized acoustic models in this proposed scenario, it is necessary to automatically identify the speaker with high accuracy. Furthermore, monolingual speech recognizers currently rely on the fact that language and/or accent will be selected beforehand by the user. This requires the user's cooperation and an interface which easily allows for such selection. Both requirements are awkward and error-prone, especially when translation services are provided for many languages using small devices like PDAs or telephones. For these scenarios, front-ends are desired which automatically identify the spoken language or accent. We believe that the automatic identification of an utterance's non-verbal cues, such as language, accent and speaker, are necessary to the successful deployment of speech-to-speech translation systems.