A Two-Level Speaker Identification System via Fusion of Heterogeneous Classifiers and Complementary Feature Cooperation

We present a new architecture to address the challenges of speaker identification that arise in interaction of humans with social robots. Though deep learning systems have led to impressive performance in many speech applications, limited speech data at training stage and short utterances with background noise at test stage present challenges and are still open problems as no optimum solution has been reported to date. The proposed design employs a generative model namely the Gaussian mixture model (GMM) and a discriminative model—support vector machine (SVM) classifiers as well as prosodic features and short-term spectral features to concurrently classify a speaker’s gender and his/her identity. The proposed architecture works in a semi-sequential manner consisting of two stages: the first classifier exploits the prosodic features to determine the speaker’s gender which in turn is used with the short-term spectral features as inputs to the second classifier system in order to identify the speaker. The second classifier system employs two types of short-term spectral features; namely mel-frequency cepstral coefficients (MFCC) and gammatone frequency cepstral coefficients (GFCC) as well as gender information as inputs to two different classifiers (GMM and GMM supervector-based SVM) which in total leads to construction of four classifiers. The outputs from the second stage classifiers; namely GMM-MFCC maximum likelihood classifier (MLC), GMM-GFCC MLC, GMM-MFCC supervector SVM, and GMM-GFCC supervector SVM are fused at score level by the weighted Borda count approach. The weight factors are computed on the fly via Mamdani fuzzy inference system that its inputs are the signal to noise ratio and the length of utterance. Experimental evaluations suggest that the proposed architecture and the fusion framework are promising and can improve the recognition performance of the system in challenging environments where the signal-to-noise ratio is low, and the length of utterance is short; such scenarios often arise in social robot interactions with humans.

[1]  Vlado Delic,et al.  One Solution of Extension of Mel-Frequency Cepstral Coefficients Feature Vector for Automatic Speaker Recognition , 2020, Inf. Technol. Control..

[2]  Gyorgy Szasz'ak,et al.  Deep learning methods in speaker recognition: a review , 2019, Period. Polytech. Electr. Eng. Comput. Sci..

[3]  Jodi Forlizzi,et al.  The Snackbot: Documenting the design of a robot for long-term Human-Robot Interaction , 2009, 2009 4th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[4]  Jagannath H. Nirmal,et al.  A unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network , 2015, 2015 Eighth International Conference on Advances in Pattern Recognition (ICAPR).

[5]  Stan Z. Li,et al.  Handbook of Biometric Anti-Spoofing , 2014, Advances in Computer Vision and Pattern Recognition.

[6]  Feng Ye,et al.  A Deep Neural Network Model for Speaker Identification , 2021, Applied Sciences.

[7]  Rania Chakroun,et al.  Robust features for text-independent speaker recognition with short utterances , 2020, Neural Computing and Applications.

[8]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[9]  E. Owens,et al.  An Introduction to the Psychology of Hearing , 1997 .

[10]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Amira S. Ashour,et al.  Multi-modal classifier fusion with feature cooperation for glaucoma diagnosis , 2019, J. Exp. Theor. Artif. Intell..

[12]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[13]  Ruili Wang,et al.  Speaker identification features extraction methods: A systematic review , 2017, Expert Syst. Appl..

[14]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[15]  O. Lartillot,et al.  A MATLAB TOOLBOX FOR MUSICAL FEATURE EXTRACTION FROM AUDIO , 2007 .

[16]  Ying Wah Teh,et al.  Text-Independent Speaker Identification Through Feature Fusion and Deep Neural Network , 2020, IEEE Access.

[17]  William M. Campbell,et al.  Support vector machines for speaker and language recognition , 2006, Comput. Speech Lang..

[18]  Erik Cambria,et al.  A survey on deep reinforcement learning for audio-based applications , 2021, Artificial Intelligence Review.

[19]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[20]  Goutam Saha,et al.  Performance comparison of speaker recognition systems in presence of duration variability , 2015, 2015 Annual IEEE India Conference (INDICON).

[21]  Wissam A. Jassim,et al.  A Robust Speaker Identification System Using the Responses from a Model of the Auditory Periphery , 2016, PloS one.

[22]  T.F. Quatieri,et al.  The effects of telephone transmission degradations on speaker recognition performance , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[23]  Lukás Burget,et al.  Support vector machines and Joint Factor Analysis for speaker verification , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Ebrahim H. Mamdani,et al.  An Experiment in Linguistic Synthesis with a Fuzzy Logic Controller , 1999, Int. J. Hum. Comput. Stud..

[25]  Agustín Álvarez Marquina,et al.  Improving speaker recognition by biometric voice deconstruction , 2015 .

[26]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[27]  K. Sreenivasa Rao,et al.  Robust Speaker Verification: A Review , 2014 .

[28]  Jiri Pribil,et al.  Evaluation of influence of spectral and prosodic features on GMM classification of Czech and Slovak emotional speech , 2013, EURASIP J. Audio Speech Music. Process..

[29]  Horst-Michael Groß,et al.  Further progress towards a home robot companion for people with mild cognitive impairment , 2012, 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[30]  Takayuki Kanda,et al.  Interactive Robots as Social Partners and Peer Tutors for Children: A Field Trial , 2004, Hum. Comput. Interact..

[31]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[32]  Paul Boersma,et al.  Speak and unSpeak with P RAATRAAT , 2002 .

[33]  John H. L. Hansen,et al.  Speaker Recognition by Machines and Humans: A tutorial review , 2015, IEEE Signal Processing Magazine.

[34]  Peter J. Murphy,et al.  Periodicity estimation in synthesized phonation signals using cepstral rahmonic peaks , 2006, Speech Commun..

[35]  Shrikanth Narayanan,et al.  Adversarial Attack and Defense Strategies for Deep Speaker Recognition Systems , 2021, Comput. Speech Lang..

[36]  Rajesh M. Hegde,et al.  Fusion of spectral and prosodic information using combined error optimization for keyword spotting , 2017, 2017 Twenty-third National Conference on Communications (NCC).

[37]  Horst-Michael Groß,et al.  I'll keep an eye on you: Home robot companion for elderly people with cognitive impairment , 2011, 2011 IEEE International Conference on Systems, Man, and Cybernetics.

[38]  Ahmad Salman,et al.  Speaker Verification Using Deep Neural Networks: A Review , 2019, International Journal of Machine Learning and Computing.

[39]  Antonio Nucci,et al.  Pitch-based gender identification with two-stage classification , 2012, Secur. Commun. Networks.

[40]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  Cemal Hanilçi,et al.  Investigation of the effect of data duration and speaker gender on text-independent speaker recognition , 2013, Comput. Electr. Eng..

[42]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[43]  Lukás Burget,et al.  iVector Fusion of Prosodic and Cepstral Features for Speaker Verification , 2011, INTERSPEECH.

[44]  C. Barsics Person recognition is easier from faces than from voices , 2014 .

[45]  Vincent Roger,et al.  Deep neural networks for automatic speech processing: a survey from large corpora to limited data , 2020, EURASIP Journal on Audio, Speech, and Music Processing.

[46]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[47]  Wai Lok Woo,et al.  Comparison of I-vector and GMM-UBM approaches to speaker identification with TIMIT and NIST 2008 databases in challenging environments , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[48]  K. Sreenivasa Rao,et al.  Robust Speaker Recognition in Noisy Environments , 2014 .

[49]  Los Angeles,et al.  The Voice Source in Speech Production: Data, Analysis and Models , 2010 .

[50]  Berat A. Erol,et al.  Speaker Recognition for Robotic Control via an IoT Device , 2018, 2018 World Automation Congress (WAC).

[51]  B. Moore An introduction to the psychology of hearing, 3rd ed. , 1989 .

[52]  Wendy A. Rogers,et al.  Domestic Robots for Older Adults: Attitudes, Preferences, and Potential , 2014, Int. J. Soc. Robotics.

[53]  Andreas Stolcke,et al.  Within-class covariance normalization for SVM-based speaker recognition , 2006, INTERSPEECH.

[54]  R Togneri,et al.  An Overview of Speaker Identification: Accuracy and Robustness Issues , 2011, IEEE Circuits and Systems Magazine.

[55]  Vasif V. Nabiyev,et al.  A new approach with score-level fusion for the classification of a speaker age and gender , 2016, Comput. Electr. Eng..

[56]  Sridha Sridharan,et al.  Feature warping for robust speaker verification , 2001, Odyssey.

[57]  Haizhou Li,et al.  Spoofing and countermeasures for speaker verification: A survey , 2015, Speech Commun..

[58]  Anton Nijholt,et al.  Socializing with Olivia, the Youngest Robot Receptionist Outside the Lab , 2010, ICSR.

[59]  John H. L. Hansen,et al.  Robust Features in Deep-Learning-Based Speech Recognition , 2017, New Era for Robust Speech Recognition, Exploiting Deep Learning.

[60]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[61]  Vijendra Raj Apsingekar,et al.  Speaker Model Clustering for Efficient Speaker Identification in Large Population Applications , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[62]  Wei Wang,et al.  A network model of speaker identification with new feature extraction methods and asymmetric BLSTM , 2020, Neurocomputing.

[63]  Roy D. Patterson Auditory models as preprocessors for speech recognition , 1992 .

[64]  Jingdong Chen,et al.  Speaker recognition based on deep learning: An overview , 2020, Neural Networks.

[65]  DeLiang Wang,et al.  A CASA-Based System for Long-Term SNR Estimation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.