Long-Term Spectro-Temporal and Static Harmonic Features for Voice Activity Detection

Accurate voice activity detection (VAD) is important for robust automatic speech recognition (ASR) systems. This paper proposes a statistical-model-based noise-robust VAD algorithm using long-term temporal information and harmonic-structure-based features in speech. Long-term temporal information has recently become an ASR focus, but has not yet been deeply investigated for VAD. In this paper, we first consider the temporal features in a cepstral domain calculated over the average phoneme duration. In contrast, the harmonic structures are well-known bearers of acoustic information in human voices, but that information is difficult to exploit statistically. This paper further describes a new method to exploit the harmonic structure information with statistical models, providing additional noise robustness. The proposed method including both the long-term temporal and the static harmonic features led to considerable improvements under low SNR conditions, with 77.7% error reduction on average as compared with the ETSI AFE-VAD in our VAD testing. In addition, the word error rate was reduced by 29.1% in a test that included a full ASR system.

[1]  Shingo Kuroiwa,et al.  DATA COLLECTION AND EVALUATION OF AURORA-2 JAPANESE CORPUS , 2003 .

[2]  Hynek Hermansky,et al.  TRAPS - classifiers of temporal patterns , 1998, ICSLP.

[3]  A. Kondoz,et al.  Analysis and improvement of a statistical model-based voice activity detector , 2001, IEEE Signal Processing Letters.

[4]  David Poeppel,et al.  The analysis of speech in different temporal integration windows: cerebral lateralization as 'asymmetric sampling in time' , 2003, Speech Commun..

[5]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[6]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[7]  Masafumi Nishimura,et al.  Short- and long-term dynamic features for robust speech recognition , 2008, INTERSPEECH.

[8]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[9]  Javier Ramírez,et al.  Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[10]  Akinori Kawamura,et al.  Robust Endpoint Detection for Speech Recognition Based on Discriminative Feature Extraction , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[11]  Géraldine Damnati,et al.  Robust speech/non-speech detection using LDA applied to MFCC , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[12]  Masakiyo Fujimoto,et al.  Study of integration of statistical model-based voice activity detection and noise suppression , 2008, INTERSPEECH.

[13]  Misha Pavel,et al.  On the relative importance of various components of the modulation spectrum for automatic speech recognition , 1999, Speech Commun..

[14]  Yanmeng Guo,et al.  Robust voice activity detection based on adaptive sub-band energy sequence analysis and harmonic detection , 2007, INTERSPEECH.

[15]  Liang Gu,et al.  Perceptual harmonic cepstral coefficients for speech recognition in noisy environment , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[16]  Satoshi Nakamura,et al.  Development of VAD evaluation framework CENSREC-1-C and investigation of relationship between VAD and speech recognition performance , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[17]  Masafumi Nishimura,et al.  Local peak enhancement combined with noise reduction algorithms for robust automatic speech recognition in automobiles , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  R. Plomp,et al.  Effect of temporal envelope smearing on speech reception. , 1994, The Journal of the Acoustical Society of America.

[19]  Masakiyo Fujimoto,et al.  Noise robust front-end processing with voice activity detection based on periodic to aperiodic component ratio , 2007, INTERSPEECH.

[20]  Arnaud Martin,et al.  Towards improving speech detection robustness for speech recognition in adverse conditions , 2003, Speech Commun..

[21]  M.N.S. Swamy,et al.  An improved voice activity detection using higher order statistics , 2005, IEEE Transactions on Speech and Audio Processing.

[22]  Satoshi Nakamura,et al.  CENSREC2: corpus and evaluation environments for in car continuous digit speech recognition , 2006, INTERSPEECH.

[23]  Douglas D. O'Shaughnessy,et al.  Robust automatic continuous-speech recognition based on a voiced-unvoiced decision , 1998, ICSLP.

[24]  R. Plomp,et al.  Effect of reducing slow temporal modulations on speech reception. , 1994, The Journal of the Acoustical Society of America.