Voice activity detection using subband noncircularity

Many voice activity detection (VAD) systems use the magnitude of complex-valued spectral representations. However, using only the magnitude often does not fully characterize the statistical behavior of the complex values. We present two novel methods for performing VAD on single- and dual-channel audio that do completely account for the second-order statistical behavior of complex data. Our methods exploit the second-order noncircularity (also known as impropriety) of complex subbands of speech and noise. Since speech tends to be more improper than noise, higher impropriety suggests speech activity. Our single-channel method is blind in the sense that it is unsupervised and, unlike many VAD systems, does not rely on non-speech periods for noise parameter estimation. Our methods achieve improved performance over other state-of-the-art magnitude-based VADs on the QUT-NOISE-TIMIT corpus, which indicates that impropriety is a compelling new feature for voice activity detection.

[1]  Masakiyo Fujimoto,et al.  Two-Microphone Voice Activity Detection Based on the Homogeneity of the Direction of Arrival Estimates , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[2]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[3]  P. Fränti,et al.  Voice Activity Detection Using MFCC Features and Support Vector Machine , 2007 .

[4]  Xiao-Lei Zhang,et al.  Deep Belief Networks Based Voice Activity Detection , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Javier Ramírez,et al.  Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[6]  Visa Koivunen,et al.  Complex random vectors and ICA models: identifiability, uniqueness, and separability , 2005, IEEE Transactions on Information Theory.

[7]  Charles Pascal Clark Coherent Demodulation of Nonstationary Random Processes , 2012 .

[8]  D. Mandic,et al.  Complex Valued Nonlinear Adaptive Filters: Noncircularity, Widely Linear and Neural Models , 2009 .

[9]  L. Scharf,et al.  Statistical Signal Processing of Complex-Valued Data: The Theory of Improper and Noncircular Signals , 2010 .

[10]  Sridha Sridharan,et al.  The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms , 2010, INTERSPEECH.

[11]  Tülay Adali,et al.  Optimization and Estimation of Complex-Valued Signals: Theory and applications in filtering and blind source separation , 2014, IEEE Signal Processing Magazine.

[12]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[13]  Danilo P. Mandic,et al.  Complex Valued Nonlinear Adaptive Filters , 2009 .

[14]  Tetsuya Ogata,et al.  Two-channel-based voice activity detection for humanoid robots in noisy home environments , 2008, 2008 IEEE International Conference on Robotics and Automation.

[15]  Les E. Atlas,et al.  Existence and estimation of impropriety in real rhythmic signals , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Sridha Sridharan,et al.  Noise robust voice activity detection using features extracted from the time-domain autocorrelation function , 2010, INTERSPEECH.

[17]  Christian Jutten,et al.  Log-Rayleigh Distribution: A Simple and Efficient Statistical Representation of Log-Spectral Coefficients , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Les E. Atlas,et al.  Extending coherence time for analysis of modulated random processes , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Joon-Hyuk Chang,et al.  Statistical model-based voice activity detection using support vector machine , 2009 .

[20]  Jianwu Dang,et al.  Voice Activity Detection Based on an Unsupervised Learning Framework , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Les E. Atlas,et al.  Estimating the noncircularity of latent components within complex-valued subband mixtures with applications to speech processing , 2014, 2014 48th Asilomar Conference on Signals, Systems and Computers.

[22]  Angelo Farina,et al.  Simultaneous Measurement of Impulse Response and Distortion with a Swept-Sine Technique , 2000 .

[23]  Gautham J. Mysore,et al.  Speaker and noise independent voice activity detection , 2013, INTERSPEECH.