Voice Activity Detection: Merging Source and Filter-based Information

Voice Activity Detection (VAD) refers to the problem of distinguishing speech segments from background noise. Numerous approaches have been proposed for this purpose. Some are based on features derived from the power spectral density, others exploit the periodicity of the signal. The goal of this letter is to investigate the joint use of source and filter-based features. Interestingly, a mutual information-based assessment shows superior discrimination power for the source-related features, especially the proposed ones. The features are further the input of an artificial neural network-based classifier trained on a multi-condition database. Two strategies are proposed to merge source and filter information: feature and decision fusion. Our experiments indicate an absolute reduction of 3% of the equal error rate when using decision fusion. The final proposed system is compared to four state-of-the-art methods on 150 minutes of data recorded in real environments. Thanks to the robustness of its source-related features, its multi-condition training and its efficient information fusion, the proposed system yields over the best state-of-the-art VAD a substantial increase of accuracy across all conditions (24% absolute on average).

[1]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[2]  Tomohiro Nakatani,et al.  Study of noise robust voice activity detection based on periodic component to aperiodic component ratio , 2006, SAPA@INTERSPEECH.

[3]  Thomas Drugman Maximum Phase Modeling for Sparse Linear Prediction of Speech , 2014, IEEE Signal Processing Letters.

[4]  Gautham J. Mysore,et al.  Speaker and noise independent voice activity detection , 2013, INTERSPEECH.

[5]  Yannis Stylianou,et al.  Fast Inter-Harmonic Reconstruction for Spectral Envelope Estimation in High-Pitched Voices , 2014, IEEE Signal Processing Letters.

[6]  Jean-Philippe Thiran,et al.  Relevant Feature Selection for Audio-Visual Speech Recognition , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[7]  Thomas Drugman Residual Excitation Skewness for Automatic Speech Polarity Detection , 2013, IEEE Signal Processing Letters.

[8]  John H. L. Hansen,et al.  Unsupervised Speech Activity Detection Using Voicing Measures and Perceptual Spectral Flux , 2013, IEEE Signal Processing Letters.

[9]  Birger Kollmeier,et al.  Speech pause detection for noise spectrum estimation by tracking power envelope dynamics , 2002, IEEE Trans. Speech Audio Process..

[10]  John Kane,et al.  COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Damjan Vlaj,et al.  A Computationally Efficient Mel-Filter Bank VAD Algorithm for Distributed Speech Recognition Systems , 2005, EURASIP J. Adv. Signal Process..

[12]  Thierry Dutoit,et al.  Chirp group delay analysis of speech signals , 2007, Speech Commun..

[13]  J. Hillenbrand,et al.  Acoustic correlates of breathy vocal quality: dysphonic voices and continuous speech. , 1996, Journal of speech and hearing research.

[14]  FangZheng,et al.  Comparison of different implementations of MFCC , 2001 .

[15]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[16]  R. Tucker,et al.  Voice activity detection using a periodicity measure , 1992 .

[17]  Zdravko Kacic,et al.  A multiconditional robust front-end feature extraction with a noise reduction procedure based on improved spectral subtraction algorithm , 2001, INTERSPEECH.

[18]  Abeer Alwan,et al.  Glottal source processing: From analysis to applications , 2014, Comput. Speech Lang..

[19]  Björn W. Schuller,et al.  Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Zheng Fang,et al.  Comparison of different implementations of MFCC , 2001 .

[21]  E. Shlomot,et al.  ITU-T Recommendation G.729 Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications , 1997, IEEE Commun. Mag..

[22]  John H. L. Hansen,et al.  Robust speech activity detection in the presence of noise , 1998, ICSLP.

[23]  Brian Kingsbury,et al.  Robust speech recognition in Noisy Environments: The 2001 IBM spine evaluation system , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24]  John S. D. Mason,et al.  A voice activity detector based on cepstral analysis , 1993, EUROSPEECH.

[25]  Javier Ramírez,et al.  Statistical voice activity detection using a multiple observation likelihood ratio test , 2005, IEEE Signal Processing Letters.

[26]  Satoshi Nakamura,et al.  Development of VAD evaluation framework CENSREC-1-C and investigation of relationship between VAD and speech recognition performance , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[27]  Javier Ramírez,et al.  Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[28]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[29]  Jianwu Dang,et al.  Voice Activity Detection Based on an Unsupervised Learning Framework , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Ananya Misra,et al.  Speech/Nonspeech Segmentation in Web Videos , 2012, INTERSPEECH.

[31]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[32]  E. M. Wright,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[33]  Tatsuya Kawahara,et al.  Voice activity detection based on optimally weighted combination of multiple features , 2005, INTERSPEECH.

[34]  Thierry Dutoit,et al.  A comparative study of glottal source estimation techniques , 2019, Comput. Speech Lang..

[35]  Damjan Vlaj,et al.  Influence of Hangover and Hangbefore Criteria on Automatic Speech Recognition , 2009, 2009 16th International Conference on Systems, Signals and Image Processing.

[36]  Peder A. Olsen,et al.  Voicing features for robust speech detection , 2005, INTERSPEECH.

[37]  Rafik A. Goubran,et al.  Robust voice activity detection using higher-order statistics in the LPC residual domain , 2001, IEEE Trans. Speech Audio Process..

[38]  Sridha Sridharan,et al.  The Delta-Phase Spectrum With Application to Voice Activity Detection and Speaker Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  Aaron E. Rosenberg,et al.  An improved endpoint detector for isolated word recognition , 1981 .

[40]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[41]  Abeer Alwan,et al.  Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics , 2019, INTERSPEECH.

[42]  Spyridon Matsoukas,et al.  Developing a Speech Activity Detection System for the DARPA RATS Program , 2012, INTERSPEECH.

[43]  张国亮,et al.  Comparison of Different Implementations of MFCC , 2001 .

[44]  Shrikanth S. Narayanan,et al.  Robust Voice Activity Detection Using Long-Term Signal Variability , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[45]  S.M. Ahadi,et al.  Voice Activity Detection based on Combination of Multiple Features using Linear/Kernel Discriminant Analyses , 2008, 2008 3rd International Conference on Information and Communication Technologies: From Theory to Applications.