Combined speech and speaker recognition with speaker-adapted connectionist models

One approach to speaker adaptation for the neural-network acoustic models of a hybrid connectionist-HMM speech recognizer is to adapt a speaker-independent network by performing a small amount of additional training using data from the target speaker, giving an acoustic model specifically tuned to that speaker. This adapted model might be useful for speaker recognition too, especially since state-of-the-art speaker recognition typically performs a speech-recognition labelling of the input speech as a first stage. However, in order to exploit the discriminant nature of the neural nets, it is better to train a single model to discriminate both between the different phone classes (as in conventional speech recognition) and between the target speaker and the ‘rest of the world’ (a common approach to speaker recognition). We present the results of using such an approach for a set of 12 speakers selected from the DARPA/NIST Broadcast News corpus. The speaker-adapted nets showed a 17% relative improvement in worderror rate on their target speakers, and were able to identify among the 12 speakers with an average equal-error rate of 6.6%.