Speaker, accent, and language identification using multilingual phone strings

In this paper we investigated the identification of non-verbal cues from spoken speech, namely speaker, accent, and language. For these tasks, a joint framework is developed which uses phone strings, derived from different language phone recognizers, as intermediate features and which performs classification decisions based on their perplexities. Our evaluation on variable distance data proved the robustness of the approach, achieving a 96.7% speaker identification rate. Furthermore, we achieved 93.7% accent discrimination accuracy between native and non-native speakers. For language identification, we obtained 95.5% classification accuracy for utterances 5 seconds in length and up to 99.89% on longer utterances. The experiments were carried out in a language independent nature, on languages not presented to the phone recognizers for training, suggesting that they could be successfully ported to non-verbal cue classification in other languages.