Compensation of extrinsic variability in speaker verification systems on simulated Skype and HF channel data

In this work we focus on speaker verification on channels of varying quality, namely Skype and high frequency (HF) radio. In our setup, we assume to have telephone recordings of speakers for training, but recordings of different channels for testing with varying (lower) signal quality. Starting from a Gaussian mixture / support vector machine (GMM/SVM) baseline, we evaluate multi-condition training (MCT), an ideal channel classification approach (ICC), and nuisance attribute projection (NAP) to compensate for the loss of information due to the transmission. In an evaluation on Switchboard-2 data using Skype and HF channel simulators, we show that, for good signal quality, NAP improves the baseline system performance from 5% EER to 3.33% EER (for both Skype and HF). For strongly distorted data, MCT or, if adequate, ICC turn out to be the method of choice.

[1]  Lukás Burget,et al.  Support vector machines and Joint Factor Analysis for speaker verification , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  William M. Campbell,et al.  Nuisance Attribute Projection , 2009, Encyclopedia of Biometrics.

[3]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[4]  Douglas A. Reynolds,et al.  Channel robust speaker verification via feature mapping , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[5]  Michael Backes,et al.  Speaker Recognition in Encrypted Voice Streams , 2010, ESORICS.

[6]  Andreas Stolcke,et al.  Feature-based and channel-based analyses of intrinsic variability in speaker verification , 2009, INTERSPEECH.

[7]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[8]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[9]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..