CASA-Based Robust Speaker Identification

Conventional speaker recognition systems perform poorly under noisy conditions. Inspired by auditory perception, computational auditory scene analysis (CASA) typically segregates speech by producing a binary time-frequency mask. We investigate CASA for robust speaker identification. We first introduce a novel speaker feature, gammatone frequency cepstral coefficient (GFCC), based on an auditory periphery model, and show that this feature captures speaker characteristics and performs substantially better than conventional speaker features under noisy conditions. To deal with noisy speech, we apply CASA separation and then either reconstruct or marginalize corrupted components indicated by a CASA mask. We find that both reconstruction and marginalization are effective. We further combine the two methods into a single system based on their complementary advantages, and this system achieves significant performance improvements over related systems under a wide range of signal-to-noise ratios.

[1]  Ning Wang,et al.  Robust Speaker Recognition Using Denoised Vocal Source and Vocal Tract Features , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Nicoleta Roman,et al.  Intelligibility of reverberant noisy speech with ideal binary masking. , 2011, The Journal of the Acoustical Society of America.

[3]  Andreas Stolcke,et al.  Modeling prosodic feature sequences for speaker recognition , 2005, Speech Commun..

[4]  DeLiang Wang,et al.  Transforming Binary Uncertainties for Robust Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  E. Owens,et al.  An Introduction to the Psychology of Hearing , 1997 .

[6]  Thomas H. Crystal,et al.  Speaker Verification by Human Listeners: Experiments Comparing Human and Machine Performance Using the NIST 1998 Speaker Evaluation Data , 2000, Digit. Signal Process..

[7]  Yong Guan,et al.  A Two-Stage Algorithm for Multi-Speaker Identification System , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[8]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[9]  Qi Li,et al.  Robust speaker identification using an auditory-based feature , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  DeLiang Wang,et al.  HMM-Based Multipitch Tracking for Noisy and Reverberant Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  William M. Campbell,et al.  Channel compensation for SVM speaker recognition , 2004, Odyssey.

[12]  Vijendra Raj Apsingekar,et al.  Speaker Identification in Room Reverberation Using GMM-UBM , 2009, 2009 IEEE 13th Digital Signal Processing Workshop and 5th IEEE Signal Processing Education Workshop.

[13]  Daniel Garcia-Romero,et al.  Linear versus mel frequency cepstral coefficients for speaker recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[14]  Jesper Jensen,et al.  MMSE based noise PSD tracking with low complexity , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Guy J. Brown,et al.  A multi-pitch tracking algorithm for noisy speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  J. Pickett,et al.  Monaural and binaural speech perception through hearing aids under noise and reverberation with normal and hearing-impaired listeners. , 1974, Journal of speech and hearing research.

[17]  Alan V. Oppenheim,et al.  Discrete-time signal processing (2nd ed.) , 1999 .

[18]  Guy J. Brown,et al.  Techniques for handling convolutional distortion with 'missing data' automatic speech recognition , 2004, Speech Commun..

[19]  Tomi Kinnunen,et al.  A Joint Approach for Single-Channel Speaker Identification and Speech Separation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Aaron E. Rosenberg,et al.  Report: A vector quantization approach to speaker recognition , 1987, AT&T Technical Journal.

[21]  DeLiang Wang,et al.  A Supervised Learning Approach to Monaural Segregation of Reverberant Speech , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Vijendra Raj Apsingekar,et al.  Speaker Model Clustering for Efficient Speaker Identification in Large Population Applications , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Najim Dehak,et al.  Discriminative and generative approaches for long- and short-term speaker characteristics modeling: application to speaker verification , 2009 .

[24]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[25]  Naveen Parihar,et al.  Analysis of the Aurora large vocabulary evaluations , 2003, INTERSPEECH.

[26]  DeLiang Wang,et al.  Model-based sequential organization in cochannel speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  DeLiang Wang,et al.  Robust Speaker Recognition Using Binary Time-Frequency Masks , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[28]  Daniel P. W. Ellis,et al.  Evaluating Source Separation Algorithms With Reverberant Speech , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Yun Lei,et al.  Towards noise-robust speaker recognition using probabilistic linear discriminant analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Larry P. Heck,et al.  A model-based transformational approach to robust speaker recognition , 2000, INTERSPEECH.

[31]  Geoffrey E. Hinton,et al.  Deep Belief Networks for phone recognition , 2009 .

[32]  John H. L. Hansen,et al.  Blind Spectral Weighting for Robust Speaker Identification under Reverberation Mismatch , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  Steven van de Par,et al.  A Binaural Scene Analyzer for Joint Localization and Recognition of Speakers in the Presence of Interfering Noise Sources and Reverberation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  David K. Burton,et al.  Text-dependent speaker verification using vector quantization source coding , 1985, IEEE Trans. Acoust. Speech Signal Process..

[35]  M. Schouten The auditory processing of speech : from sounds to words , 1992 .

[36]  DeLiang Wang,et al.  A Direct Masking Approach to Robust ASR , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Sadaoki Furui,et al.  Speaker recognition using HMM composition in noisy environments , 1996, Comput. Speech Lang..

[38]  Stanley J. Wenndt,et al.  Developing usable speech criteria for speaker identification technology , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[39]  Tara N. Sainath,et al.  Deep Belief Networks using discriminative features for phone recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Patrick Kenny,et al.  Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .

[41]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[42]  Daniel Garcia-Romero,et al.  Multicondition training of Gaussian PLDA models in i-vector space for noise and reverberation robust speaker recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  DeLiang Wang,et al.  Binary and ratio time-frequency masks for robust speech recognition , 2006, Speech Commun..

[44]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[45]  Boaz Rafaely,et al.  Reverberation matching for speaker recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[46]  Elizabeth Shriberg,et al.  Higher-Level Features in Speaker Recognition , 2007, Speaker Classification.

[47]  DeLiang Wang,et al.  Reverberant Speech Segregation Based on Multipitch Tracking and Classification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[48]  P. Krishnamoorthy,et al.  Application of combined temporal and spectral processing methods for speaker recognition under noisy, reverberant or multi-speaker environments , 2009 .

[49]  Robert E. Yantorno Co-Channel Speech and Speaker Identification Study , 1998 .

[50]  Javier Ortega-Garcia,et al.  Increasing robustness in GMM speaker recognition systems for noisy and reverberant speech with low complexity microphone arrays , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[51]  DeLiang Wang,et al.  Towards Scaling Up Classification-Based Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[52]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[53]  Sadaoki Furui,et al.  40 Years of Progress in Automatic Speaker Recognition , 2009, ICB.

[54]  DeLiang Wang,et al.  Exploring Monaural Features for Classification-Based Speech Segregation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[55]  DeLiang Wang,et al.  Auditory Segmentation Based on Onset and Offset Analysis , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[56]  Yang Lu,et al.  An algorithm that improves speech intelligibility in noise for normal-hearing listeners. , 2009, The Journal of the Acoustical Society of America.

[57]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[58]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[59]  B. Moore An introduction to the psychology of hearing, 3rd ed. , 1989 .

[60]  DeLiang Wang,et al.  Incorporating Auditory Feature Uncertainties in Robust Speaker Identification , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[61]  A. Drygajlo,et al.  Missing features detection and estimation for robust speaker verification , 1999 .

[62]  Bengt J. Borgstrom,et al.  The linear prediction inverse modulation transfer function (LP-IMTF) filter for spectral enhancement, with applications to speaker recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[63]  E. C. Cherry Some Experiments on the Recognition of Speech, with One and with Two Ears , 1953 .

[64]  Roberto Togneri,et al.  Robust speaker identification using combined feature selection and missing data recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[65]  DeLiang Wang,et al.  An Unsupervised Approach to Cochannel Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[66]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[67]  James R. Glass,et al.  Robust Speaker Recognition in Noisy Conditions , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[68]  B. Moore An Introduction to the Psychology of Hearing , 1977 .

[69]  Martin J. Russell,et al.  Text-dependent speaker verification under noisy conditions using parallel model combination , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[70]  Andrzej Drygajlo,et al.  Missing features detection and handling for robust speaker verification , 1999, EUROSPEECH.

[71]  Søren Holdt Jensen,et al.  Joint single-channel speech separation and speaker identification , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[72]  DeLiang Wang,et al.  Co-channel speaker identification using usable speech extraction based on multi-pitch tracking , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[73]  Les E. Atlas,et al.  EURASIP Journal on Applied Signal Processing 2003:7, 668–675 c ○ 2003 Hindawi Publishing Corporation Joint Acoustic and Modulation Frequency , 2003 .

[74]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[75]  Steven van de Par,et al.  Noise-Robust Speaker Recognition Combining Missing Data Techniques and Universal Background Modeling , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[76]  DeLiang Wang,et al.  Learning spectral mapping for speech dereverberation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[77]  DeLiang Wang,et al.  Sequential organization in computational auditory scene analysis , 2007 .

[78]  John R. Hershey,et al.  Super-human multi-talker speech recognition: A graphical modeling approach , 2010, Comput. Speech Lang..

[79]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[80]  E. C. Cmm,et al.  on the Recognition of Speech, with , 2008 .

[81]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[82]  Sam T. Roweis,et al.  One Microphone Source Separation , 2000, NIPS.

[83]  DeLiang Wang,et al.  On the optimality of ideal binary time-frequency masks , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[84]  Oldooz Hazrati,et al.  The combined effects of reverberation and noise on speech intelligibility by cochlear implant listeners , 2012, International journal of audiology.

[85]  DeLiang Wang,et al.  Robust Speaker Identification in Noisy and Reverberant Conditions , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[86]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[87]  Bhiksha Raj,et al.  Soft Mask Methods for Single-Channel Speaker Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[88]  Heinrich Kuttruff,et al.  Room acoustics , 1973 .

[89]  DeLiang Wang,et al.  An Auditory Scene Analysis Approach to Monaural Speech Segregation , 2006 .

[90]  Ahmad Salman,et al.  Learning Speaker-Specific Characteristics With a Deep Neural Architecture , 2011, IEEE Transactions on Neural Networks.

[91]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[92]  P.L. De Leon,et al.  Speaker Identification in the Presence of Room Reverberation , 2007, 2007 Biometrics Symposium.

[93]  Yifan Gong Noise-robust open-set speaker recognition using noise-dependent Gaussian mixture classifier , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[94]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[95]  DeLiang Wang,et al.  The role of binary mask patterns in automatic speech recognition in background noise. , 2013, The Journal of the Acoustical Society of America.

[96]  Richard M. Stern,et al.  Reconstruction of missing features for robust speech recognition , 2004, Speech Commun..

[97]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[98]  Hynek Hermansky,et al.  Factor Analysis of Auto-Associative Neural Networks With Application in Speaker Verification , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[99]  Patrick Kenny,et al.  First attempt of boltzmann machines for speaker verification , 2012, Odyssey.

[100]  DeLiang Wang,et al.  A classification based approach to speech segregation. , 2012, The Journal of the Acoustical Society of America.

[101]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[102]  Tiago H. Falk,et al.  Modulation Spectral Features for Robust Far-Field Speaker Identification , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[103]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[104]  John R. Hershey,et al.  Monaural speech separation and recognition challenge , 2010, Comput. Speech Lang..

[105]  DeLiang Wang,et al.  Robust speaker identification using auditory features and computational auditory scene analysis , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[106]  DeLiang Wang,et al.  Robust speaker identification using a CASA front-end , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[107]  A. W. M. van den Enden,et al.  Discrete Time Signal Processing , 1989 .

[108]  Douglas A. Reynolds,et al.  Comparison of background normalization methods for text-independent speaker verification , 1997, EUROSPEECH.

[109]  D. Wang,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2008, IEEE Trans. Neural Networks.

[110]  DeLiang Wang,et al.  A two-stage algorithm for one-microphone reverberant speech enhancement , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[111]  DeLiang Wang,et al.  A computational auditory scene analysis system for speech segregation and robust speech recognition , 2010, Comput. Speech Lang..

[112]  DeLiang Wang,et al.  Analyzing noise robustness of MFCC and GFCC features in speaker identification , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[113]  Andreas Stolcke,et al.  Generalized Linear Kernels for One-Versus-All Classification: Application to Speaker Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[114]  Peng Li,et al.  Monaural speech separation based on MAXVQ and CASA for robust speech recognition , 2010, Comput. Speech Lang..

[115]  DeLiang Wang,et al.  An iterative model-based approach to cochannel speech separation , 2013, EURASIP J. Audio Speech Music. Process..

[116]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[117]  Roy D. Patterson Auditory models as preprocessors for speech recognition , 1992 .

[118]  Douglas A. Reynolds,et al.  Channel robust speaker verification via feature mapping , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..