Speaker verification in a time-feature space

The goal of this dissertation is to determine the relative importance of components of the modulation spectrum for automatic speaker verification and to use this knowledge to improve the performance of an automatic speaker verification system. It is proposed that the power spectrum of a time sequence of logarithmic energy, called the modulation spectrum, provide information that may be used to reduce the effects of adverse environments. The proposed strategy is to attenuate spectral components that are not particularly useful for speaker verification. The aim is to reduce system sensitivity to telephone handset variability without reducing verification accuracy. By computing the effect of carbon-button and electret microphone transducers on the modulation spectrum of telephone speech, it is found that handset transducer variability accounts for a substantial portion of the total variability at moderate to high modulation frequencies. This is shown to be the case also at very low modulation frequencies, where variability is ascribed to the effect of a convolutional channel. This result is substantiated with verification results on the Switchboard corpora as used in 1997–1998 NIST speaker recognition evaluations. The main conclusion is that components of the modulation spectrum between 0.1 Hz and 10 Hz contain the most useful information for speaker verification. To deal with adverse environments, it is proposed that the time sequences of logarithmic energy be lowpass filtered. When compared to other filtering techniques such as cepstral mean subtraction that may retain components up to 50 Hz or RASTA processing that retains components between 1 Hz and 13 Hz, lowpass filtering to 10 Hz is found to significantly reduce verification error in conditions where handset transducers differ between training and testing. It is furthermore proposed that the feature stream be sampled down from a 100 Hz sampling rate to as low as a 25 Hz sampling rate after lowpass filtering. Using this processing, a relative reduction in error of about 10% is shown for the 1997 and 1998 NIST speaker recognition evaluations. Additional contributions of the dissertation include the design and implementation of a modular, high-performance speaker recognition toolkit.

[1]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[2]  Douglas D. O'Shaughnessy,et al.  Speech communication : human and machine , 1987 .

[3]  Hynek Hermansky,et al.  Exploring Temporal Domain for Robustness in Speech Recognition , 1995 .

[4]  Lawrence G. Bahler,et al.  Voice Identification Using Nonparametric Density Matching , 1996 .

[5]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[6]  A. B. Poritz,et al.  Linear predictive hidden Markov models and the speech signal , 1982, ICASSP.

[7]  Peter F. Brown,et al.  The acoustic-modeling problem in automatic speech recognition , 1987 .

[8]  Robert M. Gray,et al.  Speech coding based upon vector quantization , 1980, ICASSP.

[9]  R. Plomp,et al.  Effect of reducing slow temporal modulations on speech reception. , 1994, The Journal of the Acoustical Society of America.

[10]  Chin-Hui Lee,et al.  Speaker verification using normalized log-likelihood score , 1996, IEEE Trans. Speech Audio Process..

[11]  Aaron E. Rosenberg,et al.  Sub-word unit talker verification using hidden Markov models , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[12]  Douglas A. Reynolds,et al.  HTIMIT and LLHDB: speech corpora for the study of handset transducer effects , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Shuichi Itahashi,et al.  Automatic formant extraction utilizing mel scale and equal loudness contour , 1976, ICASSP.

[14]  Sarel Van Vuuren Mx: A Package for Rapid Mathematical Prototyping and AlgorithmDevelopment with Application to Speech and Speaker Recognition , 1998 .

[15]  Misha Pavel,et al.  On the importance of various modulation frequencies for speech recognition , 1997, EUROSPEECH.

[16]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[17]  John G. Proakis,et al.  Probability, random variables and stochastic processes , 1985, IEEE Trans. Acoust. Speech Signal Process..

[18]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[19]  S. Pruzansky Pattern‐Matching Procedure for Automatic Talker Recognition , 1963 .

[20]  Douglas A. Reynolds,et al.  Magnitude-only estimation of handset nonlinearity with application to speaker recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[21]  A. Gray,et al.  Distance measures for speech processing , 1976 .

[22]  Dirk Van Compernolle Noise adaptation in a hidden Markov model speech recognition system , 1989 .

[23]  J R Cohen,et al.  Application of an auditory model to speech recognition. , 1989, The Journal of the Acoustical Society of America.

[24]  Joseph P. Olive,et al.  Acoustics of American English speech , 1993 .

[25]  F S Cooper,et al.  Speaker identification by speech spectrograms: a scientists' view of its reliability for legal purposes. , 1970, The Journal of the Acoustical Society of America.

[26]  Sarel van Vuuren,et al.  On the importance of components of the modulation spectrum for speaker verification , 1998, ICSLP.

[27]  M. Hunt A statistical approach to metrics for word and syllable recognition , 1979 .

[28]  Sarel van Vuuren,et al.  Data-driven design of RASTA-like filters , 1997, EUROSPEECH.

[29]  Ivan Magrin-Chagnolleau,et al.  Second-order statistical measures for text-independent speaker identification , 1995, Speech Commun..

[30]  Hans-Günter Hirsch,et al.  Improved speech recognition using high-pass filtering of subband envelopes , 1991, EUROSPEECH.

[31]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[32]  Ted H. Applebaum,et al.  Tradeoffs in the design of regression features for word recognition , 1991, EUROSPEECH.

[33]  T. Houtgast,et al.  A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria , 1985 .

[34]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[35]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[36]  Sarel van Vuuren,et al.  Data based filter design for RASTA-like channel normalization in ASR , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[37]  W. Voiers Perceptual Bases of Speaker Identity , 1964 .

[38]  Richard J. Mammone,et al.  Speaker recognition using neural networks and conventional classifiers , 1994, IEEE Trans. Speech Audio Process..

[39]  A. W. M. van den Enden,et al.  Discrete Time Signal Processing , 1989 .

[40]  Douglas A. Reynolds,et al.  Comparison of background normalization methods for text-independent speaker verification , 1997, EUROSPEECH.

[41]  M A Lund,et al.  A robust sequential test for text-independent speaker verification. , 1996, The Journal of the Acoustical Society of America.

[42]  Sarel van Vuuren,et al.  Relevancy of time-frequency features for phonetic classification measured by mutual information , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[43]  W. J. Langford Statistical Methods , 1959, Nature.

[44]  Larry E. Humes,et al.  Modulation Transfer Functions , 1990 .

[45]  Evandro B. Gouvêa,et al.  Cepstral compensation by polynomial approximation for environment-independent speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[46]  Lou Boves,et al.  Comparison of channel normalisation techniques for automatic speech recognition over the phone , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[47]  Aaron E. Rosenberg,et al.  On the use of instantaneous and transitional spectral information in speaker recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[48]  Philip C. Woodland,et al.  Speaker adaptation of HMMs using linear regression , 1994 .

[49]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[50]  Sadaoki Furui,et al.  An Overview of Speaker Recognition Technology , 1996 .

[51]  A.E. Rosenberg,et al.  Automatic speaker verification: A review , 1976, Proceedings of the IEEE.

[52]  Pieter J. E. Vermeulen,et al.  CSLUsh: an extendible research environment , 1997, EUROSPEECH.

[53]  P. Ladefoged A course in phonetics , 1975 .

[54]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[55]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[56]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[57]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[58]  M. Degroot Optimal Statistical Decisions , 1970 .

[59]  Daniel G. Keehn,et al.  A note on learning for Gaussian properties , 1965, IEEE Trans. Inf. Theory.

[60]  B. Juang,et al.  A study on minimum error discriminative training for speaker recognition , 1995 .

[61]  Li Deng,et al.  Analysis of acoustic-phonetic variations in fluent speech using TIMIT , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[62]  Hynek Hermansky,et al.  Temporal processing of speech in a time-feature space , 1997 .

[63]  Paul Dalsgaard,et al.  On the robust automatic segmentation of spontaneous speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[64]  Chin-Hui Lee,et al.  Maximum-likelihood stochastic matching approach to non-linear equalization for robust speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[65]  Misha Pavel,et al.  Intelligibility of speech with filtered time trajectories of spectral envelopes , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[66]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[67]  George R. Doddington,et al.  Speaker verification over long distance telephone lines , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[68]  L.W.J. Boves Commercial applications of speaker verification: overview and critical succes factors , 1998 .

[69]  K. P. Li,et al.  An approach to text-independent speaker recognition with short utterances , 1983, ICASSP.

[70]  Herbert Gish,et al.  Covariance estimation methods for channel robust text-independent speaker identification , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[71]  Sadaoki Furui,et al.  Comparison of text-independent speaker recognition methods using VQ-distortion and discrete/continuous HMMs , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[72]  B. Atal Automatic Speaker Recognition Based on Pitch Contours , 1969 .

[73]  B.S. Atal,et al.  Automatic recognition of speakers from their voices , 1976, Proceedings of the IEEE.

[74]  Sarel van Vuuren,et al.  Improved neural network training of inter-word context units for connected digit recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[75]  R. R. Riesz Differential Intensity Sensitivity of the Ear for Pure Tones , 1928 .

[76]  Jean-Claude Junqua,et al.  Spectral Dynamics for Speech Recognition Under Adverse Conditions , 1996 .

[77]  Sarel van Vuuren,et al.  Comparison of text-independent speaker recognition methods on telephone speech with acoustic mismatch , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[78]  Volker Tresp,et al.  Improved Gaussian Mixture Density Estimates Using Bayesian Penalty Terms and Network Averaging , 1995, NIPS.

[79]  J. Wolf Efficient Acoustic Parameters for Speaker Recognition , 1972 .

[80]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[81]  Hynek Hermansky,et al.  Should recognizers have ears? , 1998, Speech Commun..

[82]  J. Oglesby,et al.  Optimisation of neural models for speaker identification , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[83]  Douglas A. Reynolds,et al.  A Gaussian mixture modeling approach to text-independent speaker identification , 1992 .

[84]  John K. Ousterhout,et al.  Tcl and the Tk Toolkit , 1994 .

[85]  Chin-Hui Lee,et al.  Bayesian Adaptive Learning and Map Estimation of HMM , 1996 .

[86]  R. Wohlford,et al.  A new method of text-independent speaker recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.