Robust endpoint detection and energy normalization for real-time speech and speaker recognition

When automatic speech recognition (ASR) and speaker verification (SV) are applied in adverse acoustic environments, endpoint detection and energy normalization can be crucial to the functioning of both systems. In low signal-to-noise ratio (SNR) and nonstationary environments, conventional approaches to endpoint detection and energy normalization often fail and ASR performances usually degrade dramatically. The purpose of this paper is to address the endpoint problem. For ASR, we propose a real-time approach. It uses an optimal filter plus a three-state transition diagram for endpoint detection. The filter is designed utilizing several criteria to ensure accuracy and robustness. It has almost invariant response at various background noise levels. The detected endpoints are then applied to energy normalization sequentially. Evaluation results show that the proposed algorithm significantly reduces the string error rates in low SNR situations. The reduction rates even exceed 50% in several evaluated databases. For SV, we propose a batch-mode approach. It uses the optimal filter plus a two-mixture energy model for endpoint detection. The experiments show that the batch-mode algorithm can detect endpoints as accurately as using HMM forced alignment while the proposed one has much less computational complexity.

[1]  J. Andel Sequential Analysis , 2022, The SAGE Encyclopedia of Research Design.

[2]  Biing-Hwang Juang,et al.  Minimum error rate training of inter-word context dependent acoustic model units in speech recognition , 1994, ICSLP.

[3]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[4]  J. G. Wilpon,et al.  An improved word-detection algorithm for telephone-quality speech incorporating both syntactic and semantic constraints , 1984, AT&T Bell Laboratories Technical Journal.

[5]  Josef Kittler,et al.  Optimal Edge Detectors for Ramp Edges , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Aaron E. Rosenberg,et al.  A fast algorithm for stochastic matching with application to robust speaker verification , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  John Mason,et al.  Robust voice activity detection using cepstral features , 1993, Proceedings of TENCON '93. IEEE Region 10 International Conference on Computers, Communications and Automation.

[8]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[9]  Aaron E. Rosenberg,et al.  Improved acoustic modeling for large vocabulary continuous speech recognition , 1992 .

[10]  A. Cohen,et al.  Finite Mixture Distributions , 1982 .

[11]  Aaron E. Rosenberg,et al.  General phrase speaker verification using sub-word background models and likelihood-ratio scoring , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[12]  John F. Canny,et al.  A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Qi Li,et al.  A detection approach to search-space reduction for HMM state alignment in speaker verification , 2001, IEEE Trans. Speech Audio Process..

[14]  B.S. Atal,et al.  Automatic recognition of speakers from their voices , 1976, Proceedings of the IEEE.

[15]  B. Brodsky,et al.  Nonparametric Methods in Change Point Problems , 1993 .

[16]  S. Gökhun Tanyer,et al.  Voice activity detection in nonstationary noise , 2000, IEEE Trans. Speech Audio Process..

[17]  Rakesh K. Bansal,et al.  An algorithm for detecting a change in a stochastic process , 1986, IEEE Trans. Inf. Theory.

[18]  Libor Spacek,et al.  Edge detection and motion detection , 1986, Image Vis. Comput..

[19]  K. Bullington,et al.  Engineering aspects of TASI , 1959, Transactions of the American Institute of Electrical Engineers, Part I: Communication and Electronics.

[20]  Rathinavelu Chengalvarayan,et al.  Robust energy normalization using speech/nonspeech discriminator for German connected digit recognition , 1999, EUROSPEECH.

[21]  Chin-Hui Lee,et al.  Vocabulary independent discriminative utterance verification for nonkeyword rejection in subword based speech recognition , 1996, IEEE Trans. Speech Audio Process..

[22]  Aaron E. Rosenberg,et al.  An improved endpoint detector for isolated word recognition , 1981 .

[23]  Chin-Hui Lee,et al.  Robust, real-time endpoint detector with energy normalization for ASR in adverse environments , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[24]  Jean-Claude Junqua,et al.  A study of endpoint detection algorithms in adverse conditions: incidence on a DTW and HMM recognizer , 1991, EUROSPEECH.

[25]  L. Rabiner,et al.  An algorithm for determining the endpoints of isolated utterances , 1974, The Bell System Technical Journal.

[26]  Qi Li,et al.  A language-independent personal voice controller with embedded speaker verification , 1999, EUROSPEECH.

[27]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[28]  Edward Carlstein,et al.  Change-point problems , 1994 .

[29]  Chin-Hui Lee,et al.  Acoustic modeling for large vocabulary speech recognition , 1990 .