A novel voice activity detection based on phoneme recognition using statistical model

In this article, a novel voice activity detection (VAD) approach based on phoneme recognition using Gaussian Mixture Model based Hidden Markov Model (HMM/GMM) is proposed. Some sophisticated speech features such as high order statistics (HOS), harmonic structure information and Mel-frequency cepstral coefficients (MFCCs) are employed to represent each speech/non-speech segment. The main idea of this new method is regarding the non-speech as a new phoneme corresponding to the conventional phonemes in mandarin, and all of them are then trained under maximum likelihood principle with Baum-Welch algorithm using GMM/HMM model. The Viterbi decoding algorithm is finally used for searching the maximum likelihood of the observed signals. The proposed method shows a higher speech/non-speech detection accuracy over a wide range of SNR regimes compared with some existing VAD methods. We also propose a different method to demonstrate that the conventional speech enhancement method only with accurate VAD is not effective enough for automatic speech recognition (ASR) at low SNR regimes.

[1]  I. Cohen,et al.  Noise estimation by minima controlled recursive averaging for robust speech enhancement , 2002, IEEE Signal Processing Letters.

[2]  Steve Young,et al.  The HTK book , 1995 .

[3]  Bing Chen,et al.  Implementing VoIP: a voice transmission performance progress report , 2004, IEEE Communications Magazine.

[4]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[5]  Masakiyo Fujimoto,et al.  A voice activity detection based on the adaptive integration of multiple speech features and a signal decision scheme , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Tetsuya Ogata,et al.  Real-Time Robot Audition System That Recognizes Simultaneous Speech in The Real World , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[7]  Jen-Tzung Chien,et al.  Factor Analyzed Subspace Modeling and Selection , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Masakiyo Fujimoto,et al.  Noise robust voice activity detection based on periodic to aperiodic component ratio , 2010, Speech Commun..

[9]  Ahmet M. Kondoz,et al.  Improved voice activity detection based on a smoothed statistical likelihood ratio , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[10]  Masakiyo Fujimoto,et al.  Noise Robust Voice Activity Detection Based on Statistical Model and Parallel Non-Linear Kalman Filtering , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[11]  Chip-Hong Chang,et al.  A Generalized Time–Frequency Subtraction Method for Robust Speech Enhancement Based on Wavelet Filter Banks Modeling of Human Auditory System , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[12]  Juan Manuel Górriz,et al.  Jointly Gaussian PDF-Based Likelihood Ratio Test for Voice Activity Detection , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  M.N.S. Swamy,et al.  An improved voice activity detection using higher order statistics , 2005, IEEE Transactions on Speech and Audio Processing.

[14]  Petros Maragos,et al.  Multiband Modulation Energy Tracking for Noisy Speech Detection , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Chonggang Wang,et al.  Voice communications over zigbee networks , 2008, IEEE Communications Magazine.

[16]  Climent Nadeu,et al.  Robust speech activity detection using LDA applied to FF parameters , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[17]  Kostas Plataniotis Message From the New Editor-In-Chief , 2009 .

[18]  A. Sayadian,et al.  Voice Activity Detection Using Entropy in Spectrum Domain , 2008, 2008 Australasian Telecommunication Networks and Applications Conference.

[19]  Juan Manuel Górriz,et al.  Improved Voice Activity Detection Using Contextual Multiple Hypothesis Testing for Robust Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Abeer Alwan,et al.  Voice activity detection using harmonic frequency components in likelihood ratio test , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Haihua Xu,et al.  An efficient multistage Rover method for Automatic Speech recognition , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[22]  Masafumi Nishimura,et al.  Improved voice activity detection using static harmonic features , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Rafik A. Goubran,et al.  Robust voice activity detection using higher-order statistics in the LPC residual domain , 2001, IEEE Trans. Speech Audio Process..