Frame pruning for automatic speaker identification

In this paper, we propose a frame selection procedure for text-independent speaker identification. Instead of averaging the frame likelihoods along the whole test utterance, some of these are rejected (pruning) and the final score is computed with a limited number of frames. This pruning stage requires a prior frame level likelihood normalization in order to make comparison between frames meaningful. This normalization procedure alone leads to a significative performance enhancement. As far as pruning is concerned, the optimal number of frames pruned is learned on a tuning data set for normal and telephone speech. Validation of the pruning procedure on 567 speakers leads to a significative improvement on TIMIT and NTIMIT (up to 30% error rate reduction on TIMIT).

[1]  Keinosuke Fukunaga,et al.  Statistical Pattern Recognition , 1993, Handbook of Pattern Recognition and Computer Vision.

[2]  Ivan Magrin-Chagnolleau,et al.  Second-order statistical measures for text-independent speaker identification , 1995, Speech Commun..

[3]  Douglas D. O'Shaughnessy,et al.  A double Gaussian mixture modeling approach to speaker recognition , 1997, EUROSPEECH.

[4]  Seiichi Nakagawa,et al.  Frame level likelihood normalization for text-independent speaker identification using Gaussian mixture models , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[5]  Sara H. Basson,et al.  NTIMIT: a phonetically balanced, continuous speech, telephone bandwidth speech database , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[6]  Jean-François Bonastre,et al.  Subband Approach for Automatic Speaker Recognition: Optimal Division of the Frequency Domain , 1997, AVBPA.

[7]  Andrew R. Webb,et al.  Statistical Pattern Recognition , 1999 .

[8]  H. Gish,et al.  Text-independent speaker identification , 1994, IEEE Signal Processing Magazine.