Mean Hilbert envelope coefficients (MHEC) for robust speaker and language identification

Abstract Adverse noisy conditions pose great challenges to automatic speech applications including speaker and language identification (SID and LID), where mel-frequency cepstral coefficients (MFCC) are the most commonly adopted acoustic features. Although systems trained using MFCCs provide competitive performance under matched conditions, it is well-known that such systems are susceptible to acoustic mismatch between training and test conditions due to noise and channel degradations. Motivated by this fact, this study proposes an alternative noise-robust acoustic feature front-end that is capable of capturing speaker identity as well as language structure/content conveyed in the speech signal. Specifically, a feature extraction procedure inspired by the human auditory processing is proposed. The proposed feature is based on the Hilbert envelope of Gammatone filterbank outputs that represent the envelope of the auditory nerve response. The subband amplitude modulations, which are captured through smoothed Hilbert envelopes (a.k.a. temporal envelopes), carry useful acoustic information and have been shown to be robust to signal degradations. Effectiveness of the proposed front-end, which is entitled mean Hilbert envelope coefficients (MHEC), is evaluated in the context of SID and LID tasks using degraded speech material from the DARPA Robust Automatic Transcription of Speech (RATS) program. In addition, we investigate the impact of the dynamic range compression stage in the MHEC feature extraction process on performance using logarithmic and power-law nonlinearities. Experimental results indicate that: (i) the MHEC feature is highly effective and performs favorably compared to other conventional and state-of-the-art front-ends, and (ii) the power-law non-linearity consistently yields the best performance across different conditions for both SID and LID tasks.

[1]  Paavo Alku,et al.  Comparing spectrum estimators in speaker verification under additive noise degradation , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Patrice Alexandre,et al.  Root cepstral analysis: A unified view. Application to speech processing in car noise environments , 1993, Speech Commun..

[3]  Aaron D. Lawson,et al.  Survey and evaluation of acoustic features for speaker recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[5]  Daniel Garcia-Romero,et al.  Linear versus mel frequency cepstral coefficients for speaker recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[6]  John H. L. Hansen,et al.  Analysis of the root-cepstrum for acoustic modeling and fast decoding in speech recognition , 2001, INTERSPEECH.

[7]  John H. L. Hansen,et al.  Blind Spectral Weighting for Robust Speaker Identification under Reverberation Mismatch , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  John H. L. Hansen,et al.  Unsupervised Speech Activity Detection Using Voicing Measures and Perceptual Spectral Flux , 2013, IEEE Signal Processing Letters.

[9]  Douglas A. Reynolds,et al.  Approaches to language identification using Gaussian mixture models and shifted delta cepstral features , 2002, INTERSPEECH.

[10]  Martin Graciarena,et al.  Modulation features for noise robust speaker identification , 2013, INTERSPEECH.

[11]  Mohamed Kamal Omar,et al.  TRAP language identification system for RATS phase II evaluation , 2013, INTERSPEECH.

[12]  John H. L. Hansen,et al.  Towards an Intelligent Acoustic Front End for Automatic Speech Recognition: Built-in Speaker Normalization , 2008, EURASIP J. Audio Speech Music. Process..

[13]  Tomi Kinnunen,et al.  A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  R. Patterson,et al.  Complex Sounds and Auditory Images , 1992 .

[15]  John H. L. Hansen,et al.  Mean Hilbert Envelope Coefficients (MHEC) for Robust Speaker Recognition , 2012, INTERSPEECH.

[16]  Paavo Alku,et al.  Regularized All-Pole Models for Speaker Verification Under Noisy Environments , 2012, IEEE Signal Processing Letters.

[17]  Douglas D. O'Shaughnessy,et al.  Multitaper MFCC and PLP features for speaker verification using i-vectors , 2013, Speech Commun..

[18]  L. Gavidia-Ceballos,et al.  A nonlinear operator-based speech feature analysis method with application to vocal fold pathology assessment , 1998, IEEE Transactions on Biomedical Engineering.

[19]  Kevin Walker,et al.  The RATS radio traffic collection system , 2012, Odyssey.

[20]  J. Lim Spectral root homomorphic deconvolution system , 1979, ICASSP.

[21]  David V. Anderson,et al.  Improving the noise-robustness of mel-frequency cepstral coefficients for speech processing , 2006, SAPA@INTERSPEECH.

[22]  Steven Greenberg,et al.  Robust speech recognition using the modulation spectrogram , 1998, Speech Commun..

[23]  Pietro Laface,et al.  Compensation of Nuisance Factors for Speaker and Language Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Spyridon Matsoukas,et al.  Patrol Team Language Identification System for DARPA RATS P1 Evaluation , 2012, INTERSPEECH.

[25]  John H. L. Hansen,et al.  Assessment of single-channel speech enhancement techniques for speaker identification under mismatched conditions , 2010, INTERSPEECH.

[26]  Hynek Hermansky,et al.  Feature extraction using 2-d autoregressive models for speaker recognition , 2012, Odyssey.

[27]  Peter F. Assmann,et al.  The Perception of Speech Under Adverse Conditions , 2004 .

[28]  Andreas Stolcke,et al.  Within-class covariance normalization for SVM-based speaker recognition , 2006, INTERSPEECH.

[29]  Douglas D. O'Shaughnessy,et al.  Frequency warping and robust speaker verification: a comparison of alternative mel-scale representations , 2013, INTERSPEECH.

[30]  John H. L. Hansen,et al.  Robust front-end processing for speaker identification over extremely degraded communication channels , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.

[32]  John H. L. Hansen,et al.  Blind reverberation mitigation for robust speaker identification , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[34]  John H. L. Hansen,et al.  Nonlinear feature based classification of speech under stress , 2001, IEEE Trans. Speech Audio Process..

[35]  Yun Lei,et al.  All for one: feature combination for highly channel-degraded speech activity detection , 2013, INTERSPEECH.

[36]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Qi Li,et al.  Robust speaker identification using an auditory-based feature , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[38]  Yun Lei,et al.  Improving speaker identification robustness to highly channel-degraded speech through multiple system fusion , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[39]  Sri Harish Reddy Mallidi,et al.  Neural Network Bottleneck Features for Language Identification , 2014, Odyssey.

[40]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[41]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[42]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[43]  R. Plomp,et al.  Effect of reducing slow temporal modulations on speech reception. , 1994, The Journal of the Acoustical Society of America.

[44]  M. A. Kohler,et al.  Language identification using shifted delta cepstra , 2002, The 2002 45th Midwest Symposium on Circuits and Systems, 2002. MWSCAS-2002..

[45]  John H. L. Hansen,et al.  A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition , 2008, Speech Commun..

[46]  John H. L. Hansen,et al.  Methods for stress classification: nonlinear TEO and linear speech based features , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[47]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[48]  Haizhou Li,et al.  Low-Variance Multitaper MFCC Features: A Case Study in Robust Speaker Verification , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[49]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[50]  David Vakman,et al.  On the analytic signal, the Teager-Kaiser energy algorithm, and other methods for defining amplitude and frequency , 1996, IEEE Trans. Signal Process..

[51]  John H. L. Hansen,et al.  Impact of noise reduction and spectrum estimation on noise robust speaker identification , 2013, INTERSPEECH.

[52]  Petros Maragos,et al.  Energy separation in signal modulations with application to speech analysis , 1993, IEEE Trans. Signal Process..

[53]  Yun Lei,et al.  Improving language identification robustness to highly channel-degraded speech through multiple system fusion , 2013, INTERSPEECH.

[54]  John H. L. Hansen,et al.  Hilbert envelope based features for robust speaker identification under reverberant mismatched conditions , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  DeLiang Wang,et al.  Incorporating Auditory Feature Uncertainties in Robust Speaker Identification , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[56]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[57]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .