A Tutorial on Text-Independent Speaker Verification

This paper presents an overview of a state-of-the-art text-independent speaker verification system. First, an introduction proposes a modular scheme of the training and test phases of a speaker verification system. Then, the most commonly speech parameterization used in speaker verification, namely, cepstral analysis, is detailed. Gaussian mixture modeling, which is the speaker modeling technique used in most systems, is then explained. A few speaker modeling alternatives, namely, neural networks and support vector machines, are mentioned. Normalization of scores is then explained, as this is a very important step to deal with real-world data. The evaluation of a speaker verification system is then detailed, and the detection error trade-off (DET) curve is explained. Several extensions of speaker verification are then enumerated, including speaker tracking and segmentation by speakers. Then, some applications of speaker verification are proposed, including on-site applications, remote applications, applications relative to structuring audio information, and games. Issues concerning the forensic area are then recalled, as we believe it is very important to inform people about the actual performance and limitations of speaker verification systems. This paper concludes by giving a few research trends in speaker verification for the next couple of years.

[1]  B. P. Bogert,et al.  The quefrency analysis of time series for echoes : cepstrum, pseudo-autocovariance, cross-cepstrum and saphe cracking , 1963 .

[2]  Jean-François Bonastre,et al.  E-HMM approach for learning and adapting sound models for speaker indexing , 2001, Odyssey.

[3]  Lynn Wilcox,et al.  Audio indexing using speaker identification , 1994, Optics & Photonics.

[4]  Douglas A. Reynolds,et al.  Comparison of background normalization methods for text-independent speaker verification , 1997, EUROSPEECH.

[5]  B E Koenig Spectrographic voice identification: a forensic survey. , 1986, The Journal of the Acoustical Society of America.

[6]  Aaron E. Rosenberg,et al.  Speaker background models for connected digit password speaker verification , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[7]  Jay M. Naik,et al.  A hybrid HMM-MLP speaker verification algorithm for telephone speech , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  J. E. Porter,et al.  Normalizations and selection of speech segments for speaker recognition scoring , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[9]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[10]  Christian Wellekens,et al.  DISTBIC: A speaker-based segmentation for audio data indexing , 2000, Speech Commun..

[11]  F S Cooper,et al.  Letter: Speaker identification by speech spectrograms; some further observations. , 1973, The Journal of the Acoustical Society of America.

[12]  Hynek Hermansky,et al.  A new speaker change detection method for two-speaker segmentation , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Larry P. Heck,et al.  Speaker tracking and detection with multiple speakers , 1999, EUROSPEECH.

[14]  Alvin F. Martin,et al.  The 1999 NIST speaker recognition evaluation, using summed two-channel telephone data for speaker detection and speaker tracking , 1999, EUROSPEECH.

[15]  Michael J. Carey,et al.  A speaker verification system using alpha-nets , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[16]  Yong Gu,et al.  A text-independent speaker verification system using support vector machines classifier , 2001, INTERSPEECH.

[17]  Douglas A. Reynolds,et al.  Approaches to Speaker Detection and Tracking in Conversational Speech , 2000, Digit. Signal Process..

[18]  Bernard Robertson,et al.  Interpreting Evidence: Evaluating Forensic Science in the Courtroom , 1995 .

[19]  Zhaohui Wu,et al.  Exploiting support vector machines in hidden Markov models for speaker verification , 2002, INTERSPEECH.

[20]  Didier Meuwly,et al.  Forensic speaker recognition based on a Bayesian framework and Gaussian mixture modelling (GMM) , 2001, Odyssey.

[21]  Richard J. Mammone,et al.  Speaker recognition using neural networks and conventional classifiers , 1994, IEEE Trans. Speech Audio Process..

[22]  Jean-François Bonastre,et al.  Similarity normalization method based on world model and a posteriori probability for speaker verification , 1999, EUROSPEECH.

[23]  Sylvain Meignier,et al.  The ELISA consortium approaches in speaker segmentation during the NIST 2002 speaker recognition evaluation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[24]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[25]  Gérard Chollet,et al.  Segmental Approaches for Automatic Speaker Verification , 2000, Digit. Signal Process..

[26]  J. Oglesby,et al.  Optimisation of neural models for speaker identification , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[27]  Douglas A. Reynolds,et al.  A Gaussian mixture modeling approach to text-independent speaker identification , 1992 .

[28]  G. Annas,et al.  Judging Science: Scientific Knowledge and the Federal Courts , 1999, Nature Medicine.

[29]  Ramesh A. Gopinath,et al.  Enhancing GMM scores using SVM "hints" , 2001, INTERSPEECH.

[30]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[31]  Sadaoki Furui,et al.  Recent advances in speaker recognition , 1997, Pattern Recognit. Lett..

[32]  Gérard Chollet,et al.  Combining GMM's with suport vector machines for text-independent speaker verification , 2001, INTERSPEECH.

[33]  Sadaoki Furui,et al.  Concatenated phoneme models for text-variable speaker recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[35]  Don McAllaster,et al.  Speaker verification through large vocabulary continuous speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[36]  Larry P. Heck,et al.  Handset-dependent background models for robust text-independent speaker recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[37]  Javier Ortega-Garcia,et al.  Forensic Identification Reporting Using Automatic Biometric Systems , 2002 .

[38]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[39]  Biing-Hwang Juang,et al.  The use of cohort normalized scores for speaker verification , 1992, ICSLP.

[40]  Douglas A. Reynolds,et al.  The lincoln speaker recognition system: NIST eval2000 , 2000, INTERSPEECH.

[41]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[42]  I. W. Evett,et al.  Towards a uniform framework for reporting opinions in forensic science casework , 1998 .

[43]  L. G. Kersta Voiceprint Identification , 1962, Nature.

[44]  James A. Bucklew,et al.  Support vector machines and the multiple hypothesis test problem , 2001, IEEE Trans. Signal Process..

[45]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[46]  F S Cooper,et al.  Speaker identification by speech spectrograms: a scientists' view of its reliability for legal purposes. , 1970, The Journal of the Acoustical Society of America.

[47]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[48]  Alvin F. Martin,et al.  The NIST Speaker Recognition Evaluations: 1996-2001 , 1998, Odyssey.

[49]  Ernst Bunge Automatic speaker recognition by computers , 1976, ICASSP.

[50]  Frédéric Bimbot,et al.  A Monte-Carlo method for score normalization in Automatic Speaker Verification using Kullback-Leibler distances , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[51]  Lou Boves,et al.  Local Normalization and Delayed Decision Making in Speaker Detection and Tracking , 2000, Digit. Signal Process..

[52]  A. Oppenheim,et al.  Homomorphic analysis of speech , 1968 .

[53]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[54]  Hirotaka Nakasone,et al.  Forensic automatic speaker recognition , 2001, Odyssey.

[55]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[56]  Sadaoki Furui,et al.  Likelihood normalization for speaker verification using a phoneme- and speaker-independent model , 1995, Speech Commun..

[57]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[58]  Herbert J. Oyer,et al.  Experiment on Voice Identification , 1972 .

[59]  Aaron E. Rosenberg,et al.  Speaker detection in broadcast speech databases , 1998, ICSLP.

[60]  Roland Auckenthaler,et al.  Score Normalization for Text-Independent Speaker Verification Systems , 2000, Digit. Signal Process..

[61]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[62]  Sadaoki Furui,et al.  Recent Advances in Speaker Recognition (Invited Paper) , 1997, AVBPA.

[63]  Didier Meuwly,et al.  The inference of identity in forensic speaker recognition , 2000, Speech Commun..

[64]  Sadaoki Furui,et al.  Comparison of speaker recognition methods using statistical features and dynamic features , 1981 .

[65]  A Richardson,et al.  The evidential value of the comparison of paint flakes from sources other than vehicles. , 1968, Journal - Forensic Science Society.

[66]  R. Bracewell The Fourier Transform and Its Applications , 1966 .

[67]  H. Bourlard,et al.  Links Between Markov Models and Multilayer Perceptrons , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[68]  Herbert Gish,et al.  Speaker identification via support vector classifiers , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[69]  Alan V. Oppenheim,et al.  Discrete-Time Signal Pro-cessing , 1989 .

[70]  Javier Ortega-Garcia,et al.  On the application of the Bayesian approach in real forensic conditions with GMM-based systems , 2001, Odyssey.

[71]  Maurizio Falcone,et al.  A PC speaker identification system for forensic use: IDEM , 1994 .

[72]  D. A. Reynolds,et al.  The effects of handset variability on speaker recognition performance: experiments on the Switchboard corpus , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[73]  Alexander H. Waibel,et al.  Strategies for automatic segmentation of audio data , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[74]  Lawrence G. Bahler,et al.  Speaker verification using randomized phrase prompting , 1991, Digit. Signal Process..

[75]  Javier Ortega-Garcia,et al.  Speech variability in automatic speaker recognition systems for commercial and forensic purposes , 2000 .

[76]  Douglas A. Reynolds,et al.  Sheep, Goats, Lambs and Wolves: An Analysis of Individual Differences in Speaker Recognition Perfo , 1998 .