Voice Activity Detection in Presence of Transient Noise Using Spectral Clustering

Voice activity detection has attracted significant research efforts in the last two decades. Despite much progress in designing voice activity detectors, voice activity detection (VAD) in presence of transient noise is a challenging problem. In this paper, we develop a novel VAD algorithm based on spectral clustering methods. We propose a VAD technique which is a supervised learning algorithm. This algorithm divides the input signal into two separate clusters (i.e., speech presence and speech absence frames). We use labeled data in order to adjust the parameters of the kernel used in spectral clustering methods for computing the similarity matrix. The parameters obtained in the training stage together with the eigenvectors of the normalized Laplacian of the similarity matrix and Gaussian mixture model (GMM) are utilized to compute the likelihood ratio needed for voice activity detection. Simulation results demonstrate the advantage of the proposed method compared to conventional statistical model-based VAD algorithms in presence of transient noise.

[1]  Joon-Hyuk Chang,et al.  Voice activity detection based on complex Laplacian model , 2003 .

[2]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[3]  David L. Donoho,et al.  Image Manifolds which are Isometric to Euclidean Space , 2005, Journal of Mathematical Imaging and Vision.

[4]  Israel Cohen,et al.  Transient Noise Reduction Using Nonlocal Diffusion Filters , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Arnaud Martin,et al.  Towards improving speech detection robustness for speech recognition in adverse conditions , 2003, Speech Commun..

[6]  I. Cohen,et al.  AR-GARCH in Presence of Noise: Parameter Estimation and Its Application to Voice Activity Detection , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[8]  H.S. Jamadagni,et al.  VAD techniques for real-time speech transmission on the Internet , 2002, 5th IEEE International Conference on High Speed Networks and Multimedia Communication (Cat. No.02EX612).

[9]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[10]  Israel Cohen,et al.  Dominant speaker identification for multipoint videoconferencing , 2012, 2012 IEEE 27th Convention of Electrical and Electronics Engineers in Israel.

[11]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  Kumar Swaminathan,et al.  Noise reduction and echo cancellation front-end for speech codecs , 2003, IEEE Trans. Speech Audio Process..

[13]  Haim H. Permuter,et al.  Gaussian mixture models of texture and colour for image database retrieval , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[14]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[15]  Beng Chin Ooi,et al.  Indexing the Distance: An Efficient Method to KNN Processing , 2001, VLDB.

[16]  Masakiyo Fujimoto,et al.  A tandem connectionist model using combination of multi-scale spectro-temporal features for acoustic event detection , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Ronald R. Coifman,et al.  Parametrization of Linear Systems Using Diffusion Kernels , 2012, IEEE Transactions on Signal Processing.

[18]  Ronald R. Coifman,et al.  Graph Laplacian Tomography From Unknown Random Projections , 2008, IEEE Transactions on Image Processing.

[19]  Masakiyo Fujimoto,et al.  Noise robust voice activity detection based on periodic to aperiodic component ratio , 2010, Speech Commun..

[20]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[21]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[22]  Javier Ramírez,et al.  Statistical voice activity detection using a multiple observation likelihood ratio test , 2005, IEEE Signal Processing Letters.

[23]  Matthias Hein,et al.  Intrinsic dimensionality estimation of submanifolds in Rd , 2005, ICML.

[24]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[25]  Stéphane Lafon,et al.  Diffusion maps , 2006 .

[26]  Ming Liu,et al.  HMM-Based Acoustic Event Detection with AdaBoost Feature Selection , 2007, CLEAR.

[27]  E. Shlomot,et al.  ITU-T Recommendation G.729 Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications , 1997, IEEE Commun. Mag..

[28]  Joon-Hyuk Chang,et al.  Voice activity detection based on generalized gamma distribution , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[29]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[30]  Hamid Sheikhzadeh,et al.  ETSI AMR-2 VAD: evaluation and ultra low-resource implementation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[31]  Sven Nordholm,et al.  Statistical Voice Activity Detection Using Low-Variance Spectrum Estimation and an Adaptive Threshold , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Pietro Perona,et al.  Self-Tuning Spectral Clustering , 2004, NIPS.

[33]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[34]  Mark Hasegawa-Johnson,et al.  Improving acoustic event detection using generalizable visual features and multi-modality modeling , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Jonathan G. Fiscus,et al.  Multimodal Technologies for Perception of Humans, International Evaluation Workshops CLEAR 2007 and RT 2007, Baltimore, MD, USA, May 8-11, 2007, Revised Selected Papers , 2008, CLEAR.

[36]  P. Niyogi,et al.  A Geometric Perspective on Speech Sounds , 2005 .

[37]  Israel Cohen,et al.  Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging , 2003, IEEE Trans. Speech Audio Process..

[38]  Javier Ramírez,et al.  A new adaptive long-term spectral estimation voice activity detector , 2003, INTERSPEECH.

[39]  Matthias Hein,et al.  Intrinsic Dimensionality Estimation of Submanifolds in Euclidean space , 2005, ICML 2005.

[40]  Michael I. Jordan,et al.  Learning Spectral Clustering, With Application To Speech Separation , 2006, J. Mach. Learn. Res..

[41]  Marina Meila,et al.  Spectral Clustering of Biological Sequence Data , 2005, AAAI.

[42]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.