Efficient voice activity detection algorithm based on sub-band temporal envelope and sub-band long-term signal variability

Voice activity detection (VAD) is widely used for various speech-based systems which is an important pre-processing step. This paper proposes a robust voice activity detection algorithm. In the proposed algorithm, the sub-band temporal envelope and the sub-band long-term signal variability are considered to distinguish the speech from all kinds of non-speech which include stationary noise and non-stationary noise. The two features are combined to make a robust VAD decision according to the fusion decision. The proposed algorithm also is an unsupervised low-complexity algorithm and can operate without pre-train models. The experiments results show that the proposed algorithm is prior to the different baseline algorithms and can handle a variety of noise environments over a wide range of signal-to-noise ratios. The proposed algorithm could apply to speech-based systems.

[1]  John S. Collura,et al.  MELP: the new Federal Standard at 2400 bps , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  I. Johansson,et al.  The adaptive multi-rate speech coder , 1999, 1999 IEEE Workshop on Speech Coding Proceedings. Model, Coders, and Error Criteria (Cat. No.99EX351).

[3]  Jr. S. Marple,et al.  Computing the discrete-time 'analytic' signal via FFT , 1999, Conference Record of the Thirty-First Asilomar Conference on Signals, Systems and Computers (Cat. No.97CB36136).

[4]  Shrikanth S. Narayanan,et al.  Robust Voice Activity Detection Using Long-Term Signal Variability , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[6]  Andreas Stolcke,et al.  Multispeaker speech activity detection for the ICSI meeting recorder , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[7]  M. Gabrea,et al.  Correlation coefficient-based voice activity detector algorithm , 2004, Canadian Conference on Electrical and Computer Engineering 2004 (IEEE Cat. No.04CH37513).

[8]  Spyridon Matsoukas,et al.  Developing a Speech Activity Detection System for the DARPA RATS Program , 2012, INTERSPEECH.

[9]  Bowon Lee MINIMUM MEAN-SQUARED ERROR A POSTERIORI ESTIMATION OF HIGH VARIANCE VEHICULAR NOISE , .

[10]  Chungyong Lee,et al.  Robust voice activity detection algorithm for estimating noise spectrum , 2000 .

[11]  Andrzej Drygajlo,et al.  Entropy based voice activity detection in very noisy conditions , 2001, INTERSPEECH.

[12]  Yoshihiko Nankaku,et al.  Voice activity detection based on conditional random fields using multiple features , 2010, INTERSPEECH.

[13]  G. Clark,et al.  Reference , 2008 .

[14]  Ananya Misra,et al.  Speech/Nonspeech Segmentation in Web Videos , 2012, INTERSPEECH.

[15]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[16]  E. Shlomot,et al.  ITU-T Recommendation G.729 Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications , 1997, IEEE Commun. Mag..

[17]  Sang-Sik Ahn,et al.  Statistical Model-Based VAD Algorithm with Wavelet Transform , 2006, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[18]  Björn W. Schuller,et al.  Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  K. Shikano,et al.  Noise estimation using negentropy based voice-activity detector , 2004, The 2004 47th Midwest Symposium on Circuits and Systems, 2004. MWSCAS '04..

[20]  S. Casale,et al.  Performance evaluation and comparison of G.729/AMR/fuzzy voice activity detectors , 2002, IEEE Signal Processing Letters.

[21]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[22]  Douglas A. Reynolds,et al.  An overview of automatic speaker recognition technology , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Narimene Lezzoum,et al.  A low-complexity voice activity detector for smart hearing protection of hyperacusic persons , 2013, INTERSPEECH.

[24]  Nima Mesgarani,et al.  Discrimination of speech from nonspeech based on multiscale spectro-temporal Modulations , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Xianglong Liu,et al.  An improved noise-robust voice activity detector based on hidden semi-Markov models , 2011, Pattern Recognit. Lett..