Robust front-end processing for speaker identification over extremely degraded communication channels

Effective front-end processing, which often involves feature extraction and speech activity detection (SAD), is essential for robustness in speech systems. In this study, we propose an unsupervised SAD scheme based on four different speech voicing measures which are combined with a perceptual spectral flux feature. Effectiveness of the proposed scheme is evaluated and compared against several commonly adopted unsupervised SAD methods under actual harsh acoustic conditions. As an example application, we also evaluate performance of the proposed SAD in the context of an i-vector based speaker identification (SID) system, where the recently introduced mean Hilbert envelope coefficients (MHEC) are benchmarked against conventional MFCCs. Long and spontaneous conversational audio recordings from DARPA program RATS (Phase-I) are used in our evaluations. Experimental results indicate that the proposed SAD solution is highly effective and provides superior performance compared to other unsupervised SAD techniques considered. In addition, it is shown that MHECs are effective alternatives to MFCCs for SID tasks under severe degraded channel conditions.

[1]  Pavel Matejka,et al.  Hierarchical Structures of Neural Networks for Phoneme Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[2]  Sven Nordholm,et al.  Statistical Voice Activity Detection Using Low-Variance Spectrum Estimation and an Adaptive Threshold , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Sridha Sridharan,et al.  The Delta-Phase Spectrum With Application to Voice Activity Detection and Speaker Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Lawrence R. Rabiner,et al.  A pattern recognition approach to voiced-unvoiced-silence classification with applications to speech recognition , 1976 .

[5]  Javier Ramírez,et al.  Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[6]  Brett Y. Smolenski,et al.  Adaptive high accuracy approaches to speech activity detection in noisy and hostile audio environments , 2010, INTERSPEECH.

[7]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[8]  Abeer Alwan,et al.  Voice activity detection using harmonic frequency components in likelihood ratio test , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[11]  Brian Kingsbury,et al.  Robust speech recognition in Noisy Environments: The 2001 IBM spine evaluation system , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  John H. L. Hansen,et al.  Mean Hilbert Envelope Coefficients (MHEC) for Robust Speaker Recognition , 2012, INTERSPEECH.

[13]  John H. L. Hansen,et al.  Unsupervised Speech Activity Detection Using Voicing Measures and Perceptual Spectral Flux , 2013, IEEE Signal Processing Letters.

[14]  Spyridon Matsoukas,et al.  Developing a Speech Activity Detection System for the DARPA RATS Program , 2012, INTERSPEECH.

[15]  John H. L. Hansen,et al.  Hilbert envelope based features for robust speaker identification under reverberant mismatched conditions , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  E. Shlomot,et al.  ITU-T Recommendation G.729 Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications , 1997, IEEE Commun. Mag..

[17]  Mohamed Omar Speech Activity Detection for Noisy Data Using Adaptation Techniques , 2012, INTERSPEECH.

[18]  Daniel Garcia-Romero,et al.  Linear versus mel frequency cepstral coefficients for speaker recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[19]  Haizhou Li,et al.  Low-Variance Multitaper MFCC Features: A Case Study in Robust Speaker Verification , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[21]  Javier Ramírez,et al.  Statistical voice activity detection using a multiple observation likelihood ratio test , 2005, IEEE Signal Processing Letters.

[22]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Ronald W. Schafer,et al.  Theory and Applications of Digital Speech Processing , 2010 .

[24]  John H. L. Hansen,et al.  Robust speech activity detection in the presence of noise , 1998, ICSLP.

[25]  Kevin Walker,et al.  The RATS radio traffic collection system , 2012, Odyssey.

[26]  Sven Nordholm,et al.  A multi-decision sub-band voice activity detector , 2006, 2006 14th European Signal Processing Conference.

[27]  Paavo Alku,et al.  Regularized All-Pole Models for Speaker Verification Under Noisy Environments , 2012, IEEE Signal Processing Letters.

[28]  Douglas D. O'Shaughnessy,et al.  Multitaper MFCC and PLP features for speaker verification using i-vectors , 2013, Speech Commun..

[29]  P. Boersma ACCURATE SHORT-TERM ANALYSIS OF THE FUNDAMENTAL FREQUENCY AND THE HARMONICS-TO-NOISE RATIO OF A SAMPLED SOUND , 1993 .

[30]  Spyridon Matsoukas,et al.  Patrol Team Language Identification System for DARPA RATS P1 Evaluation , 2012, INTERSPEECH.