Soft‐decision a priori knowledge interpolation for robust telephone speaker identification

Abstract Handsets which are not seen in the training phase (a.k.a unseen handsets) are main sources of performance degradation for speaker identification (SID) applications in telecommunication environments. To alleviate the problem, a soft‐decision a priori knowledge interpolation (SD‐AKI) method of handset characteristic estimation for handset mismatch‐compensated SID is proposed in this paper. The idea of the SD‐AKI method is to first collect a set of characteristics of seen handsets in the training phase, and to then estimate the characteristic of the unknown testing handset by interpolating the set of seen handset characteristics in the test phase. The estimated handset characteristic is then used to compensate for handset mismatch for robust SID. The SD‐AKI method can be realized in both feature and model spaces. Experimental results on the handset TIMIT (HTIMIT) database showed that both the proposed feature‐ and model‐space SD‐AKI schemes were more robust than the blind cepstral mean subtraction (CMS), feature warping (FW) methods and their hard‐decision counterpart (HD‐AKI) for both cases of all‐handset and unseen‐handset SID tests. It is therefore a promising robust SID method.

[1]  Rosângela Coelho,et al.  Text-independent speaker recognition based on the Hurst parameter and the multidimensional fractional Brownian motion model , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Yunxin Zhao,et al.  Frequency-domain maximum likelihood estimation for automatic speech recognition in additive and convolutive noises , 2000, IEEE Trans. Speech Audio Process..

[3]  R. P. Ramachandran,et al.  Robust speaker recognition: a feature-based approach , 1996, IEEE Signal Processing Magazine.

[4]  Mari Ostendorf,et al.  Reducing the effects of linear channel distortion on continuous speech recognition , 1999, IEEE Trans. Speech Audio Process..

[5]  Douglas A. Reynolds,et al.  HTIMIT and LLHDB: speech corpora for the study of handset transducer effects , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[7]  Joab R Winkler,et al.  Numerical recipes in C: The art of scientific computing, second edition , 1993 .

[8]  Douglas A. Reynolds,et al.  Estimation of handset nonlinearity with application to speaker recognition , 2000, IEEE Trans. Speech Audio Process..

[9]  Yuan-Fu Liao,et al.  Prosody modeling and eigen-prosody analysis for robust speaker recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[10]  Chin-Hui Lee,et al.  A maximum-likelihood approach to stochastic matching for robust speech recognition , 1996, IEEE Trans. Speech Audio Process..

[11]  Ramesh A. Gopinath,et al.  Efficient, Low Latency Adaptation for Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[12]  Biing-Hwang Juang,et al.  Signal bias removal by maximum likelihood estimation for robust telephone speech recognition , 1996, IEEE Trans. Speech Audio Process..

[13]  Ramesh A. Gopinath,et al.  Short-time Gaussianization for robust speaker verification , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[15]  Douglas A. Reynolds,et al.  Channel robust speaker verification via feature mapping , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[16]  C. R. Rao,et al.  On the convexity of some divergence measures based on entropy functions , 1982, IEEE Trans. Inf. Theory.

[17]  Stéphane H. Maes,et al.  Multigrained modeling with pattern specific maximum likelihood transformations for text-independent speaker recognition , 2003, IEEE Trans. Speech Audio Process..

[18]  Yuan-Fu Liao,et al.  Unseen handset mismatch compensation based on feature/model-space a priori knowledge interpolation for robust speaker recognition , 2004, 2004 International Symposium on Chinese Spoken Language Processing.

[19]  Jen-Tzung Chien,et al.  Phone-dependent channel compensated hidden Markov model for telephone speech recognition , 1998, IEEE Signal Processing Letters.

[20]  Sridha Sridharan,et al.  Data-driven clustering for blind feature mapping in speaker verification , 2005, INTERSPEECH.

[21]  William H. Press,et al.  Numerical Recipes in FORTRAN - The Art of Scientific Computing, 2nd Edition , 1987 .

[22]  Li Deng,et al.  A robust compensation strategy for extraneous acoustic variations in spontaneous speech recognition , 2002, IEEE Trans. Speech Audio Process..

[23]  G. Bennington Foundations , 2007 .

[24]  Y. Gong A method of joint compensation of additive and convolutive distortions for speaker-independent speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[25]  Larry P. Heck,et al.  Robust text-independent speaker identification over telephone channels , 1999, IEEE Trans. Speech Audio Process..

[26]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[27]  Sun-Yuan Kung,et al.  Combining stochastic feature transformation and handset identification for telephone-based speaker verification , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[28]  M. Faundez-Zanuy,et al.  State-of-the-art in speaker recognition , 2005, IEEE Aerospace and Electronic Systems Magazine.

[29]  Jeih-Weih Hung,et al.  Optimization of temporal filters for constructing robust features in speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[31]  Mitch Weintraub,et al.  Model transformation for robust speaker recognition from telephone data , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[32]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[33]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[34]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[35]  Sridha Sridharan,et al.  Feature warping for robust speaker verification , 2001, Odyssey.

[36]  Douglas A. Reynolds,et al.  Modeling prosodic dynamics for speaker recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[37]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[38]  Sun-Yuan Kung,et al.  A new approach to channel robust speaker verification via constrained stochastic feature transformation , 2004, INTERSPEECH.

[39]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[40]  Hsiao-Chuan Wang,et al.  Combination of autocorrelation-based features and projection measure technique for speaker identification , 2005, IEEE Trans. Speech Audio Process..