Robust Speaker Identification Based on Binaural Masks

Abstract The performance of the far-field speaker identification (SI) system is usually reduced by the well-known mismatch problem imposed by environmental conditions. Speech enhancement methods are known as convenient ways of resolving the mismatches created by additive noise and reverberations. Human auditory capability for identifying and segregating sounds of speakers in complex environmental conditions motivates researchers to employ known aspects of binaural hearing in speech separation and enhancement methods. This paper proposes a solution to the mismatch problem by employing binaural speech separation methods as front-end processing in the i-vector-based speaker identification. Here, the speech separation approaches utilize binaural masks in their structure to improve the performance of the SI systems by enhancing mixture signals in realistic environmental conditions. For this purpose, two binaural masks, namely, model-based expectation-maximization interaural coherence mask (MEICM) and a recently-introduced DNN-based mask, are employed in the framework of the proposed SI structure. To evaluate the new binaural SI structure, an experiment is conducted which examines various ratio masks in the i-vector-based speaker identification with diffused multi-talker babble noise and reverberation. The simulation results show that employing the DNN-based ratio mask in the binaural speech separation front-end achieves the highest identification performance among other mask estimation methods.

[1]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[2]  Atiyeh Alinaghi,et al.  Spatial and coherence cues based time-frequency masking for binaural reverberant speech separation , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  G. Kramer Auditory Scene Analysis: The Perceptual Organization of Sound by Albert Bregman (review) , 2016 .

[4]  Thomas H. Crystal,et al.  Speaker Verification by Human Listeners: Experiments Comparing Human and Machine Performance Using the NIST 1998 Speaker Evaluation Data , 2000, Digit. Signal Process..

[5]  Lukás Burget,et al.  Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Norbert Dillier,et al.  A fast and accurate “shoebox” room acoustics simulator , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  C. Faller,et al.  Source localization in complex listening situations: selection of binaural cues based on interaural coherence. , 2004, The Journal of the Acoustical Society of America.

[8]  Masoud Geravanchizadeh,et al.  Robust binaural speech separation in adverse conditions based on deep neural network with modified spatial features and training target , 2019, Speech Commun..

[9]  Ruth Y Litovsky,et al.  The benefit of binaural hearing in a cocktail party: effect of location and type of interferer. , 2004, The Journal of the Acoustical Society of America.

[10]  Philip J. B. Jackson,et al.  Modeling the Comb Filter Effect and Interaural Coherence for Binaural Source Separation , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  DeLiang Wang,et al.  Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation. , 2006, The Journal of the Acoustical Society of America.

[12]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[13]  Damien Garcia,et al.  Robust smoothing of gridded data in one and higher dimensions with missing values , 2010, Comput. Stat. Data Anal..

[14]  Najim Dehak,et al.  Discriminative and generative approaches for long- and short-term speaker characteristics modeling: application to speaker verification , 2009 .

[15]  Yifan Gong Noise-robust open-set speaker recognition using noise-dependent Gaussian mixture classifier , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Daniel P. W. Ellis,et al.  Model-Based Expectation-Maximization Source Separation and Localization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Roberto Togneri,et al.  Robust speaker identification using combined feature selection and missing data recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[19]  DeLiang Wang,et al.  Robust Speaker Identification in Noisy and Reverberant Conditions , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Patrick Kenny,et al.  Modeling Prosodic Features With Joint Factor Analysis for Speaker Verification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[22]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[23]  DeLiang Wang,et al.  Robust Speaker Recognition Based on Single-Channel and Multi-Channel Speech Enhancement , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Richard M. Stern,et al.  Mask classification for missing-feature reconstruction for robust speech recognition in unknown background noise , 2011, Speech Commun..

[25]  J. Blauert Spatial Hearing: The Psychophysics of Human Sound Localization , 1983 .

[26]  Homayoon Beigi,et al.  Fundamentals of Speaker Recognition , 2011 .

[27]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Thomas Esch,et al.  Model-Based Dereverberation Preserving Binaural Cues , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  R.M. Stern,et al.  Missing-feature approaches in speech recognition , 2005, IEEE Signal Processing Magazine.

[30]  W. G. Gardner,et al.  HRTF measurements of a KEMAR , 1995 .

[31]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[32]  Steven van de Par,et al.  A Binaural Scene Analyzer for Joint Localization and Recognition of Speakers in the Presence of Interfering Noise Sources and Reverberation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Paavo Alku,et al.  Temporally Weighted Linear Prediction Features for Tackling Additive Noise in Speaker Verification , 2010, IEEE Signal Processing Letters.

[34]  DeLiang Wang,et al.  A Feature Study for Classification-Based Speech Separation at Low Signal-to-Noise Ratios , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[35]  Shijie Zhang,et al.  DELTA: indexing and querying multi-labeled graphs , 2011, CIKM '11.

[36]  Patrick Kenny,et al.  Disentangling speaker and channel effects in speaker verification , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[37]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[38]  Yi Jiang,et al.  Binaural Classification for Reverberant Speech Segregation Using Deep Neural Networks , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[39]  DeLiang Wang,et al.  CASA-Based Robust Speaker Identification , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[40]  John H. L. Hansen,et al.  Speaker Recognition by Machines and Humans: A tutorial review , 2015, IEEE Signal Processing Magazine.

[41]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[42]  R. Venkatesan,et al.  Binaural Classification-Based Speech Segregation and Robust Speaker Recognition System , 2018, Circuits Syst. Signal Process..

[43]  DeLiang Wang,et al.  Deep Learning Based Binaural Speech Separation in Reverberant Environments , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[44]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.