CASA-Based Speaker Identification Using Cascaded GMM-CNN Classifier in Noisy and Emotional Talking Conditions

This work aims at intensifying text-independent speaker identification performance in real application situations such as noisy and emotional talking conditions. This is achieved by incorporating two different modules: a Computational Auditory Scene Analysis (CASA) based pre-processing module for noise reduction and “cascaded Gaussian Mixture Model – Convolutional Neural Network (GMMCNN) classifier for speaker identification” followed by emotion recognition. This research proposes and evaluates a novel algorithm to improve the accuracy of speaker identification in emotional and highly-noise susceptible conditions. Experiments demonstrate that the proposed model yields promising results in comparison with other classifiers when “Speech Under Simulated and Actual Stress (SUSAS) database, Emirati Speech Database (ESD), the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)” database and the “Fluent Speech Commands” database are used in a noisy environment.

[1]  Madhu Kumar Patnala,et al.  A NOVEL SCHEME FOR ROBUST SPEAKER IDENTIFICATION IN PRESENCE OF NOISE AND REVERBERATIONS , 2015 .

[2]  Miguel Cazorla,et al.  Semi-supervised 3D object recognition through CNN labeling , 2018, Appl. Soft Comput..

[3]  Mario Ignacio Chacon Murguia,et al.  Classification of multiple motor imagery using deep convolutional neural networks and spatial filters , 2019, Appl. Soft Comput..

[4]  R. Plomp,et al.  Effect of temporal envelope smearing on speech reception. , 1994, The Journal of the Acoustical Society of America.

[5]  Zhenming Feng,et al.  Performance analysis of ideal binary masks in speech enhancement , 2011, 2011 4th International Congress on Image and Signal Processing.

[6]  Yousef Ajami Alotaibi,et al.  Speaker Identification in Different Emotional States in Arabic and English , 2020, IEEE Access.

[7]  Osama S. Faragallah,et al.  Robust noise MKMFCC–SVM automatic speaker identification , 2018, International Journal of Speech Technology.

[8]  J. Astola,et al.  SPEAKER RECOGNITION IN AN EMOTIONAL ENVIRONMENT , 2011 .

[9]  Engin Avci,et al.  Speech recognition using a wavelet packet adaptive network based fuzzy inference system , 2006, Expert Syst. Appl..

[10]  E. Gehan A GENERALIZED WILCOXON TEST FOR COMPARING ARBITRARILY SINGLY-CENSORED SAMPLES. , 1965, Biometrika.

[11]  Ismail Shahin,et al.  Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments , 2018, Neural Computing and Applications.

[12]  Khaled Shaalan,et al.  Speech Recognition Using Deep Neural Networks: A Systematic Review , 2019, IEEE Access.

[13]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[14]  John H. L. Hansen,et al.  Getting started with SUSAS: a speech under simulated and actual stress database , 1997, EUROSPEECH.

[15]  Yingxue Wang,et al.  SpeakerGAN: Speaker identification with conditional generative adversarial network , 2020, Neurocomputing.

[16]  Ismail Shahin Employing Second-Order Circular Suprasegmental Hidden Markov Models to Enhance Speaker Identification Performance in Shouted Talking Environments , 2010, EURASIP J. Audio Speech Music. Process..

[17]  DeLiang Wang,et al.  Monaural Speech Separation , 2002, NIPS.

[18]  DeLiang Wang,et al.  CASA-Based Robust Speaker Identification , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Djemili Rafik,et al.  Text-Independent Speaker Identification by Combining MFCC and MVA Features , 2018, 2018 International Conference on Signal, Image, Vision and their Applications (SIVA).

[20]  Thomas Fang Zheng,et al.  Overview of Front-end Features for Robust Speaker Recognition , 2011 .

[21]  Weihui Dai,et al.  Cost-Sensitive Learning for Emotion Robust Speaker Recognition , 2014, TheScientificWorldJournal.

[22]  Ismail Shahin,et al.  Emirati-accented speaker identification in each of neutral and shouted talking environments , 2018, Int. J. Speech Technol..

[23]  Wissam A. Jassim,et al.  A Robust Speaker Identification System Using the Responses from a Model of the Auditory Periphery , 2016, PloS one.

[24]  Chiman Kwan,et al.  Robust Speaker Identification Algorithms and Results in Noisy Environments , 2018, ISNN.

[25]  Farah Chenchah,et al.  Emotional speaker recognition based on i-vector space model , 2016, 2016 4th International Conference on Control Engineering & Information Technology (CEIT).

[26]  Danna Zhou,et al.  d. , 1840, Microbial pathogenesis.

[27]  Wenju Liu,et al.  Binary mask estimation for voiced speech segregation using Bayesian method , 2011, The First Asian Conference on Pattern Recognition.

[28]  Jamal Kharroubi,et al.  A deep learning approach for speaker recognition , 2020, Int. J. Speech Technol..

[29]  Theodoros Iliou,et al.  Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011 , 2012, Artificial Intelligence Review.

[30]  K. YogeshC.,et al.  Hybrid BBO_PSO and higher order spectral features for emotion and stress recognition from natural speech , 2017, Appl. Soft Comput..

[31]  S. R. Livingstone,et al.  The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.

[32]  Montri Karnjanadecha,et al.  PITCH DETECTION ALGORITHM : AUTOCORRELATION METHOD AND AMDF , 2003 .

[33]  Ismail Shahin,et al.  Employing both gender and emotion cues to enhance speaker identification performance in emotional talking environments , 2013, International Journal of Speech Technology.

[34]  Ismail Shahin,et al.  Emotion Recognition Using Hybrid Gaussian Mixture Model and Deep Neural Network , 2019, IEEE Access.

[35]  Tianqi Chen,et al.  Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[36]  DeLiang Wang,et al.  A computational auditory scene analysis system for speech segregation and robust speech recognition , 2010, Comput. Speech Lang..

[37]  Saeed Bagheri Shouraki,et al.  Evaluation of a novel fuzzy sequential pattern recognition tool (fuzzy elastic matching machine) and its applications in speech and handwriting recognition , 2018, Appl. Soft Comput..

[38]  Douglas D. O'Shaughnessy,et al.  Multi-taper MFCC features for speaker verification using I-vectors , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[39]  Kemal Polat,et al.  Automatic determination of digital modulation types with different noises using Convolutional Neural Network based on time-frequency information , 2020, Appl. Soft Comput..

[40]  Sadaoki Furui Speaker-dependent-feature extraction, recognition and processing techniques , 1991, Speech Commun..

[41]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[42]  S. El-Rabaie,et al.  Speaker identification based on normalized pitch frequency and Mel Frequency Cepstral Coefficients , 2018, Int. J. Speech Technol..

[43]  DeLiang Wang,et al.  Robust speaker identification using auditory features and computational auditory scene analysis , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[44]  Hamid Sheikhzadeh,et al.  Single channel speech separation in modulation frequency domain based on a novel pitch range estimation method , 2012, EURASIP Journal on Advances in Signal Processing.

[45]  Lukás Burget,et al.  Analysis of DNN approaches to speaker identification , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Ismail Shahin,et al.  Employing Emotion Cues to Verify Speakers in Emotional Talking Environments , 2017, J. Intell. Syst..

[47]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..