Speaker-specific mapping for text-independent speaker recognition

In this paper, we present the concept of speaker-specific mapping for the task of speaker recognition. The speaker-specific mapping is realized using a multilayer feedforward neural network. In the mapping approach, the aim is to capture the speaker-specific information by mapping a set of parameter vectors specific to linguistic information in the speech, to a set of parameter vectors having linguistic and speaker information. In this study, parameter vectors suitable for speaker-specific mapping are explored. Background normalization for score comparison and network error criterion for frame selection are proposed to improve the performance of the basic system. It is shown that removing the high frequency components of speech results in loss of performance of the speaker verification system. For all the 630 speakers of the TIMIT database, an equal error rate (EER) of 0.5% and 100% identification is achieved by the mapping approach. On a set of 38 speakers of the dialect region "dr1" of NTIMIT database, an EER of 6.6% is obtained.

[1]  A. Roli Artificial Neural Networks , 2012, Lecture Notes in Computer Science.

[2]  T.F. Quatieri,et al.  The effects of telephone transmission degradations on speaker recognition performance , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[3]  R. Lippmann,et al.  An introduction to computing with neural nets , 1987, IEEE ASSP Magazine.

[4]  A.E. Rosenberg,et al.  Automatic speaker verification: A review , 1976, Proceedings of the IEEE.

[5]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[6]  J.-P. Haton,et al.  Nonlinear vectorial interpolation for speaker recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Ken-ichi Funahashi,et al.  On the approximate realization of continuous mappings by neural networks , 1989, Neural Networks.

[8]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[9]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[10]  Hynek Hermansky,et al.  RASTA-PLP speech analysis technique , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Larry P. Heck,et al.  Handset-dependent background models for robust text-independent speaker recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Til T. Phan,et al.  Text-Independent Speaker Identification , 1999 .

[13]  Sara H. Basson,et al.  NTIMIT: a phonetically balanced, continuous speech, telephone bandwidth speech database , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[14]  Sadaoki Furui,et al.  Robust methods of updating model and a priori threshold in speaker verification , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[15]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[16]  Bayya Yegnanarayana,et al.  A distance measure based on the derivative of linear prediction phase spectrum , 1979, ICASSP.

[17]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[18]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[19]  John S. D. Mason,et al.  Automatically focusing on good discriminating speech segments in speaker recognition , 1990, ICSLP.