Voice Activity Detection Algorithm Using Zero Frequency Filter Assisted Peaking Resonator and Empirical Mode Decomposition

Abstract In this article, a new adaptive data-driven strategy for voice activity detection (VAD) using empirical mode decomposition (EMD) is proposed. Speech data are decomposed using an a posteriori, adaptive, data-driven EMD in the time domain to yield a set of physically meaningful intrinsic mode functions (IMFs). Each IMF preserves the nonlinear and nonstationary property of the speech utterance. Among a set of IMFs, the IMF that contains source information dominantly called characteristic IMF (CIMF) can be identified and extracted by designing a zero-frequency filter-assisted peaking resonator. The detected CIMF is used to compute energy using short-term processing. Choosing proper threshold, voiced regions in speech utterances are detected using frame energy. The proposed framework has been studied on both clean speech utterance and noisy speech utterance (0-dB white noise). The proposed method is used for voice activity detection (VAD) in the presence of white noise and shows encouraging result in the presence of white noise up to 0 dB.

[1]  Giuseppe Ruggeri,et al.  Performance evaluation and comparison of ITU-T/ETSI voice activity detectors , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[2]  Bayya Yegnanarayana,et al.  Epoch Extraction From Speech Signals , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  S. Gökhun Tanyer,et al.  Voice activity detection in nonstationary noise , 2000, IEEE Trans. Speech Audio Process..

[4]  N. Huang,et al.  The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis , 1998, Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[5]  Lawrence R. Rabiner,et al.  Voiced-unvoiced-silence detection using the Itakura LPC distance measure , 1977 .

[6]  S. R. Mahadeva Prasanna,et al.  Speaker verification in sensor and acoustic environment mismatch conditions , 2012, Int. J. Speech Technol..

[7]  Liang Li,et al.  Nonlinear adaptive prediction of nonstationary signals , 1995, IEEE Trans. Signal Process..

[8]  S. R. Mahadeva Prasanna,et al.  Vowel Onset Point Detection Using Source, Spectral Peaks, and Modulation Spectrum Energies , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Brian Mak,et al.  A robust speech/non-speech detection algorithm using time and frequency-based features , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  S. R. Mahadeva Prasanna,et al.  Speaker Verification by Vowel and Nonvowel Like Segmentation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  S. R. Mahadeva Prasanna,et al.  Speaker verification under degraded condition: a perceptual study , 2011, Int. J. Speech Technol..

[12]  Michael Feldman,et al.  Hilbert Transform Applications in Mechanical Vibration: Feldman/Hilbert Transform Applications in Mechanical Vibration , 2011 .

[13]  R. Tucker,et al.  Voice activity detection using a periodicity measure , 1992 .

[14]  K. Sakhnov,et al.  Voice Activity Detection for Speech Enhancement Applications , 2010 .

[15]  M. Mills,et al.  Recognition of mother's voice in early infancy , 1974, Nature.

[16]  Gabriel Rilling,et al.  Empirical mode decomposition as a filter bank , 2004, IEEE Signal Processing Letters.

[17]  Sophocles J. Orfanidis,et al.  Introduction to signal processing , 1995 .

[18]  Gabriel Rilling,et al.  One or Two Frequencies? The Empirical Mode Decomposition Answers , 2008, IEEE Transactions on Signal Processing.

[19]  Norden E. Huang,et al.  A review on Hilbert‐Huang transform: Method and its applications to geophysical studies , 2008 .