Speaker Verification Under Degraded Conditions Using Empirical Mode Decomposition Based Voice Activity Detection Algorithm

Abstract The performance of most of the state-of-the-art speaker recognition (SR) systems deteriorates under degraded conditions, owing to mismatch between the training and testing sessions. This study focuses on the front end of the speaker verification (SV) system to reduce the mismatch between training and testing. An adaptive voice activity detection (VAD) algorithm using zero-frequency filter assisted peaking resonator (ZFFPR) was integrated into the front end of the SV system. The performance of this proposed SV system was studied under degraded conditions with 50 selected speakers from the NIST 2003 database. The degraded condition was simulated by adding different types of noises to the original speech utterances. The different types of noises were chosen from the NOISEX-92 database to simulate degraded conditions at signal-to-noise ratio levels from 0 to 20 dB. In this study, widely used 39-dimension Mel frequency cepstral coefficient (MFCC; i.e., 13-dimension MFCCs augmented with 13-dimension velocity and 13-dimension acceleration coefficients) features were used, and Gaussian mixture model–universal background model was used for speaker modeling. The proposed system’s performance was studied against the energy-based VAD used as the front end of the SV system. The proposed SV system showed some encouraging results when EMD-based VAD was used at its front end.

[1]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[2]  Gabriel Rilling,et al.  Empirical mode decomposition as a filter bank , 2004, IEEE Signal Processing Letters.

[3]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[4]  Douglas A. Reynolds,et al.  A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[5]  Yuesheng Xu,et al.  Recent Mathematical Developments on Empirical Mode Decomposition , 2009, Adv. Data Sci. Adapt. Anal..

[6]  John Mason,et al.  Robust voice activity detection using cepstral features , 1993, Proceedings of TENCON '93. IEEE Region 10 International Conference on Computers, Communications and Automation.

[7]  Birger Kollmeier,et al.  Speech pause detection for noise spectrum estimation by tracking power envelope dynamics , 2002, IEEE Trans. Speech Audio Process..

[8]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[9]  Douglas A. Reynolds,et al.  An overview of automatic speaker recognition technology , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[11]  R. Tucker,et al.  Voice activity detection using a periodicity measure , 1992 .

[12]  S. Hahn Hilbert Transforms in Signal Processing , 1996 .

[13]  N. Huang,et al.  The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis , 1998, Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[14]  V. Kamakshi Prasad,et al.  Voice Activity Detection Algorithm Using Zero Frequency Filter Assisted Peaking Resonator and Empirical Mode Decomposition , 2013, J. Intell. Syst..

[15]  K. Bullington,et al.  Engineering aspects of TASI , 1959, Transactions of the American Institute of Electrical Engineers, Part I: Communication and Electronics.

[16]  H. Gish,et al.  Text-independent speaker identification , 1994, IEEE Signal Processing Magazine.

[17]  M.N.S. Swamy,et al.  An improved voice activity detection using higher order statistics , 2005, IEEE Transactions on Speech and Audio Processing.

[18]  M. Mills,et al.  Recognition of mother's voice in early infancy , 1974, Nature.

[19]  Cui Huijuan,et al.  Voice Activity Detection in Non-stationary Noise , 2006, The Proceedings of the Multiconference on "Computational Engineering in Systems Applications".

[20]  Lawrence R. Rabiner,et al.  Voiced-unvoiced-silence detection using the Itakura LPC distance measure , 1977 .

[21]  B.S. Atal,et al.  Automatic recognition of speakers from their voices , 1976, Proceedings of the IEEE.

[22]  Sadaoki Furui,et al.  Recent advances in speaker recognition , 1997, Pattern Recognit. Lett..

[23]  S. R. Mahadeva Prasanna,et al.  Speaker verification in sensor and acoustic environment mismatch conditions , 2012, Int. J. Speech Technol..

[24]  Daniel N. Kaslovsky,et al.  Noise Corruption of Empirical Mode Decomposition and its Effect on Instantaneous Frequency , 2010, Adv. Data Sci. Adapt. Anal..

[25]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[26]  N. Huang,et al.  A new view of nonlinear water waves: the Hilbert spectrum , 1999 .

[27]  H.S. Jamadagni,et al.  VAD techniques for real-time speech transmission on the Internet , 2002, 5th IEEE International Conference on High Speed Networks and Multimedia Communication (Cat. No.02EX612).

[28]  I. Boyd,et al.  The voice activity detector for the Pan-European digital cellular mobile telephone service , 1988, International Conference on Acoustics, Speech, and Signal Processing,.

[29]  P. Morse The discrimination of speech and nonspeech stimuli in early infancy. , 1972, Journal of experimental child psychology.

[30]  Bayya Yegnanarayana,et al.  Epoch Extraction From Speech Signals , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[32]  Jacques Mehler,et al.  Infant Recognition of Mother's Voice , 1978, Perception.

[33]  S. R. Mahadeva Prasanna,et al.  Speaker Verification by Vowel and Nonvowel Like Segmentation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Michael Feldman,et al.  Hilbert Transform Applications in Mechanical Vibration: Feldman/Hilbert Transform Applications in Mechanical Vibration , 2011 .

[35]  S. R. M. Prasanna,et al.  Significance of Vowel-Like Regions for Speaker Verification Under Degraded Conditions , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  A.E. Rosenberg,et al.  Automatic speaker verification: A review , 1976, Proceedings of the IEEE.

[37]  S. R. Mahadeva Prasanna,et al.  Speaker verification under degraded condition: a perceptual study , 2011, Int. J. Speech Technol..

[38]  Arnaud Martin,et al.  Towards improving speech detection robustness for speech recognition in adverse conditions , 2003, Speech Commun..

[39]  N. Huang,et al.  A study of the characteristics of white noise using the empirical mode decomposition method , 2004, Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[40]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[41]  R. P. Ramachandran,et al.  Robust speaker recognition: a feature-based approach , 1996, IEEE Signal Processing Magazine.

[42]  K. Sakhnov,et al.  Voice Activity Detection for Speech Enhancement Applications , 2010 .

[43]  Sadaoki Furui,et al.  50 Years of Progress in Speech and Speaker Recognition Research , 1970 .

[44]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[45]  Aaron E. Rosenberg,et al.  On the use of instantaneous and transitional spectral information in speaker recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[46]  Javier Ramírez,et al.  Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[47]  M. Faundez-Zanuy,et al.  State-of-the-art in speaker recognition , 2005, IEEE Aerospace and Electronic Systems Magazine.

[48]  Liang Li,et al.  Nonlinear adaptive prediction of nonstationary signals , 1995, IEEE Trans. Signal Process..

[49]  Daryl H. Graf,et al.  An introduction to speech and speaker recognition , 1990, Computer.

[50]  Gabriel Rilling,et al.  One or Two Frequencies? The Empirical Mode Decomposition Answers , 2008, IEEE Transactions on Signal Processing.

[51]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[52]  Sophocles J. Orfanidis,et al.  Introduction to signal processing , 1995 .