论文信息 - Long-Term Spectro-Temporal and Static Harmonic Features for Voice Activity Detection

Long-Term Spectro-Temporal and Static Harmonic Features for Voice Activity Detection

Accurate voice activity detection (VAD) is important for robust automatic speech recognition (ASR) systems. This paper proposes a statistical-model-based noise-robust VAD algorithm using long-term temporal information and harmonic-structure-based features in speech. Long-term temporal information has recently become an ASR focus, but has not yet been deeply investigated for VAD. In this paper, we first consider the temporal features in a cepstral domain calculated over the average phoneme duration. In contrast, the harmonic structures are well-known bearers of acoustic information in human voices, but that information is difficult to exploit statistically. This paper further describes a new method to exploit the harmonic structure information with statistical models, providing additional noise robustness. The proposed method including both the long-term temporal and the static harmonic features led to considerable improvements under low SNR conditions, with 77.7% error reduction on average as compared with the ETSI AFE-VAD in our VAD testing. In addition, the word error rate was reduced by 29.1% in a test that included a full ASR system.

[1] Shingo Kuroiwa,et al. DATA COLLECTION AND EVALUATION OF AURORA-2 JAPANESE CORPUS , 2003 .

[2] Hynek Hermansky,et al. TRAPS - classifiers of temporal patterns , 1998, ICSLP.

[3] A. Kondoz,et al. Analysis and improvement of a statistical model-based voice activity detector , 2001, IEEE Signal Processing Letters.

[4] David Poeppel,et al. The analysis of speech in different temporal integration windows: cerebral lateralization as 'asymmetric sampling in time' , 2003, Speech Commun..

[5] Hynek Hermansky,et al. Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[6] Ephraim. Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[7] Masafumi Nishimura,et al. Short- and long-term dynamic features for robust speech recognition , 2008, INTERSPEECH.

[8] Wonyong Sung,et al. A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[9] Javier Ramírez,et al. Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[10] Akinori Kawamura,et al. Robust Endpoint Detection for Speech Recognition Based on Discriminative Feature Extraction , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[11] Géraldine Damnati,et al. Robust speech/non-speech detection using LDA applied to MFCC , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[12] Masakiyo Fujimoto,et al. Study of integration of statistical model-based voice activity detection and noise suppression , 2008, INTERSPEECH.

[13] Misha Pavel,et al. On the relative importance of various components of the modulation spectrum for automatic speech recognition , 1999, Speech Commun..

[14] Yanmeng Guo,et al. Robust voice activity detection based on adaptive sub-band energy sequence analysis and harmonic detection , 2007, INTERSPEECH.

[15] Liang Gu,et al. Perceptual harmonic cepstral coefficients for speech recognition in noisy environment , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[16] Satoshi Nakamura,et al. Development of VAD evaluation framework CENSREC-1-C and investigation of relationship between VAD and speech recognition performance , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[17] Masafumi Nishimura,et al. Local peak enhancement combined with noise reduction algorithms for robust automatic speech recognition in automobiles , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18] R. Plomp,et al. Effect of temporal envelope smearing on speech reception. , 1994, The Journal of the Acoustical Society of America.

[19] Masakiyo Fujimoto,et al. Noise robust front-end processing with voice activity detection based on periodic to aperiodic component ratio , 2007, INTERSPEECH.

[20] Arnaud Martin,et al. Towards improving speech detection robustness for speech recognition in adverse conditions , 2003, Speech Commun..

[21] M.N.S. Swamy,et al. An improved voice activity detection using higher order statistics , 2005, IEEE Transactions on Speech and Audio Processing.

[22] Satoshi Nakamura,et al. CENSREC2: corpus and evaluation environments for in car continuous digit speech recognition , 2006, INTERSPEECH.

[23] Douglas D. O'Shaughnessy,et al. Robust automatic continuous-speech recognition based on a voiced-unvoiced decision , 1998, ICSLP.

[24] R. Plomp,et al. Effect of reducing slow temporal modulations on speech reception. , 1994, The Journal of the Acoustical Society of America.