Experiments with speaker verification over the telephone

SUMMARY Wehavepresented aseries ofexperiments inspeaker ver-ification for both high quality speech and telephone speechusing a statistical approach based on HMM phone mod-els. The decoding procedure has been efficiently imple-mentedbyprocessingallthemodelsinparallelusingatime-synchronous beam search strategy. Speaker verification (oridentification) can be carried out in both text-dependent ortext-independent modes using the same phone models.For text-independent verification, the phone based ap-proach was shown to clearly out-perform a simpler Gaus-sian mixture model on high-quality speech from the BREFcorpus and fortelephone speech has a 20% lower a posteri-oriequal errorrate. FortheBREF corpus, text-independentand text-dependentverificationEERs wereaboutthesame. 2 On the telephone corpus, text-dependent verification per-forms better than text-independent. When a verificationattempt fails, allowing a second trial reduces the number oferrors by 20%, while only increasing the number of trialsby 10%. For the telephone speech corpus, the majority ofthe errors are due to low scores for a few target speakers,mostly reflecting differences in the origin of the call for th etraining and testing sessions. In an additional experimenton the telephone speech, combining a model for F0 withthe speaker-specific phone model set did not significantlyimprove performance. On the telephone speech corpus, ana posteriori equal error rate of 2.9% was obtained using aminimum duration of 2s per trial, in text-dependent mode,allowing 2 trials per attempt. This can be contrasted withthe equal error rate obtained on the high quality speechcorpus which is well under 1%.

[1]  Maxine Eskénazi,et al.  BREF, a large vocabulary spoken corpus for French , 1991, EUROSPEECH.

[2]  H. Gish,et al.  Text-independent speaker identification , 1994, IEEE Signal Processing Magazine.

[3]  Jean-Luc Gauvain,et al.  Continuous Speech Recognition at LIMSI , 1992 .

[4]  Sadaoki Furui,et al.  Comparison of text-independent speaker recognition methods using VQ-distortion and discrete/continuous HMMs , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Jean-Luc Gauvain,et al.  A phone-based approach to non-linguistic speech feature identification , 1995, Comput. Speech Lang..

[6]  Sadaoki Furui,et al.  Concatenated phoneme models for text-variable speaker recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Chin-Hui Lee,et al.  Bayesian learning for hidden Markov model with Gaussian mixture state observation densities , 1991, Speech Commun..

[8]  Douglas A. Reynolds,et al.  Text independent speaker identification using automatic acoustic segmentation , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[9]  John J. Godfrey,et al.  Macrophone: an American English telephone speech corpus for the Polyphone project , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  John J. Godfrey Multilingual Speech Databases at LDC , 1994, HLT.

[11]  Aaron E. Rosenberg,et al.  Sub-word unit talker verification using hidden Markov models , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[12]  Naftali Z. Tisby On the application of mixture AR hidden Markov models to text independent speaker recognition , 1991, IEEE Trans. Signal Process..

[13]  A. B. Poritz,et al.  Linear predictive hidden Markov models and the speech signal , 1982, ICASSP.

[14]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[15]  Jean-Luc Gauvain,et al.  Identification of Non-Linguistic Speech Features , 1993, HLT.