Extraction of Speaker Features from Different Stages of DSR Front-Ends for Distributed Speaker Verification

The ETSI has recently published a front-end processing standard for distributed speech recognition systems. The key idea of the standard is to extract the spectral features of speech signals at the front-end terminals so that acoustic distortion caused by communication channels can be avoided. This paper investigates the effect of extracting spectral features from different stages of the front-end processing on the performance of distributed speaker verification systems. A technique that combines handset selectors with stochastic feature transformation is also employed in a back-end speaker verification system to reduce the acoustic mismatch between different handsets. Because the feature vectors obtained from the back-end server are vector quantized, the paper proposes two approaches to adding Gaussian noise to the quantized feature vectors for training the Gaussian mixture speaker models. In one approach, the variances of the Gaussian noise are made dependent on the codeword distance. In another approach, the variances are a function of the distance between some unquantized training vectors and their closest code vector. The HTIMIT corpus was used in the experiments and results based on 150 speakers show that stochastic feature transformation can be added to the back-end server for compensating transducer distortion. It is also found that better verification performance can be achieved when the LMS-based blind equalization in the standard is replaced by stochastic feature transformation.

[1]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[2]  Douglas A. Reynolds,et al.  Estimation of handset nonlinearity with application to speaker recognition , 2000, IEEE Trans. Speech Audio Process..

[3]  Sun-Yuan Kung,et al.  Cluster-Dependent Feature Transformation for Telephone-Based Speaker Verification , 2003, AVBPA.

[4]  Darren Pearce,et al.  Enabling new speech driven services for mobile devices: An overview of the ETSI standards activities , 2000 .

[5]  Stephan Euler,et al.  The influence of speech coding algorithms on automatic speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Sun-Yuan Kung,et al.  Stochastic Feature Transformation with Divergence-Based Out-of-Handset Rejection for Robust Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[7]  Kuldip K. Paliwal,et al.  Effect of Speech Coders on Speech Recognition Performance , 1996, Fourth International Symposium on Signal Processing and Its Applications.

[8]  Sun-Yuan Kung,et al.  Divergence-based out-of-class rejection for telephone handset identification , 2002, INTERSPEECH.

[9]  Douglas A. Reynolds,et al.  HTIMIT and LLHDB: speech corpora for the study of handset transducer effects , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[11]  Lou Boves,et al.  Noise reduction for noise robust feature extraction for distributed speech recognition , 2001, INTERSPEECH.

[12]  Sun-Yuan Kung,et al.  Biometric Authentication: A Machine Learning Approach , 2004 .

[13]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[14]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[15]  Sun-Yuan Kung,et al.  Maximum Likelihood and Maximum a Posteriori Adaptation for Distributed Speaker Recognition Systems , 2004, ICBA.

[16]  Sadaoki Furui,et al.  Recent advances in speaker recognition , 1997, Pattern Recognit. Lett..

[17]  Sun-Yuan Kung,et al.  Combining stochastic feature transformation and handset identification for telephone-based speaker verification , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Sun-Yuan Kung,et al.  Environment adaptation for robust speaker verification , 2003, INTERSPEECH.

[19]  Florian Hilger,et al.  Investigations on the combination of four algorithms to increase the noise robustness of a DSR front-end for real world car data , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[20]  Carmen García-Mateo,et al.  Distributed speech recognition over IP networks on the Aurora 3 database , 2002, INTERSPEECH.

[21]  David Pearce,et al.  Speech recognition performance comparison between DSR and AMR transcoded speech , 2002, INTERSPEECH.

[22]  G. Fant Acoustic theory of speech production : with calculations based on X-ray studies of Russian articulations , 1961 .

[23]  Chin-Hui Lee,et al.  A maximum-likelihood approach to stochastic matching for robust speech recognition , 1996, IEEE Trans. Speech Audio Process..