Speaker identification based on Radon transform and CNNs in the presence of different types of interference for Robotic Applications

Abstract Both automatic speaker identification (ASI) and speech recognition can be utlized now for the control of modern robots. An ASI algorithm can be implemented at a speech interface of the robot to determine the identity of the person allowed to deal with the robot, while speech recognition can be implemented for the interpretation of the order given to the robot. Robustness of the ASI system is a challenging task in the presence of speech degradations such as noise and interference. This study presents a new approach to improve the accuracy of speaker identification in the presence of interference for robot control applications with a convolutional neural network (CNN). First, the speech signal from the speaker is divided into segments, each of which is transformed into a spectrogram, and hence Radon transformation is estimated for this spectrogram. The spectrogram resolves the speech segment into a map of power distribution with both time and frequency. Together, the spectrograms and their Radon transforms are used as inputs to a proposed CNN-based deep learning model. Necessary refinements are undertaken and the resulting optimized “Radon-Deep-Learning Model (RDLM) is compared with a benchmark model. The proposed model consists of six convolutional (CNV) layers followed by six Max. pooling layers, while the benchmark model consists of three CNV layers followed by three Max. pooling layers. Experimental results reveal that the proposed RDLM model achieves a high classification accuracy up to 97.5%, which is more than double the performance reported for some traditional methods that are used for speaker identification.

[1]  Abdullah M. Iliyasu,et al.  Deploying Machine and Deep Learning Models for Efficient Data-Augmented Detection of COVID-19 Infections , 2020, Viruses.

[2]  Wissam A. Jassim,et al.  Radon transform of auditory neurograms: a robust feature set for phoneme classification , 2018, IET Signal Process..

[3]  Tara N. Sainath,et al.  Improvements to Deep Convolutional Neural Networks for LVCSR , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[4]  Weisi Lin Multimedia Analysis, Processing and Communications , 2011 .

[5]  Y. X. Zou,et al.  An experimental study of speech emotion recognition based on deep convolutional neural networks , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[6]  Douglas A. Reynolds,et al.  An overview of automatic speaker recognition technology , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[8]  Saeid Nahavandi,et al.  A deep-structural medical image classification for a Radon-based image retrieval , 2017, 2017 IEEE 30th Canadian Conference on Electrical and Computer Engineering (CCECE).

[9]  I. Elamvazuthi,et al.  Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques , 2010, ArXiv.

[10]  Dong Yu,et al.  Exploring convolutional neural network structures and optimization techniques for speech recognition , 2013, INTERSPEECH.

[11]  Lior Wolf,et al.  I know that voice: Identifying the voice actor behind the voice , 2015, 2015 International Conference on Biometrics (ICB).

[12]  Mateusz Budnik,et al.  Deep complementary features for speaker identification in TV broadcast data , 2016, Odyssey.

[13]  Lawrence R. Rabiner,et al.  Automatic Speech Recognition - A Brief History of the Technology Development , 2004 .

[14]  El Bachir Tazi,et al.  An hybrid front-end for robust speaker identification under noisy conditions , 2017, 2017 Intelligent Systems Conference (IntelliSys).

[15]  Ali Muayad Jalil,et al.  Speaker identification using convolutional neural network for clean and noisy speech samples , 2019, 2019 First International Conference of Computer and Applied Sciences (CAS).

[16]  Alicja Kwasniewska,et al.  Speaker Recognition Using Convolutional Neural Network with Minimal Training Data for Smart Home Solutions , 2018, 2018 11th International Conference on Human System Interaction (HSI).

[17]  Nazar A. Saqib,et al.  Robust speaker recognition for e-commerce system , 2015, 2015 International Conference on Radar, Antenna, Microwave, Electronics and Telecommunications (ICRAMET).

[18]  Chin-Hui Lee,et al.  An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[19]  Yifan Gong,et al.  An Overview of Noise-Robust Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Pawan K. Ajmera,et al.  Text-independent speaker identification using Radon and discrete cosine transforms based features from speech spectrogram , 2011, Pattern Recognit..

[21]  Mahesh Chandra,et al.  Speaker recognition and verification using artificial neural network , 2017, 2017 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET).

[22]  Saifur Rahman,et al.  SPEAKER IDENTIFICATION USING MEL FREQUENCY CEPSTRAL COEFFICIENTS , 2004 .

[23]  S. El-Rabaie,et al.  Speaker recognition based on pre-processing approaches , 2020, Int. J. Speech Technol..

[24]  Kittisak Kerdprasop,et al.  Text-Independent Speaker Identification Using Deep Learning Model of Convolution Neural Network , 2019, International Journal of Machine Learning and Computing.

[25]  Li Dan,et al.  Speech recognition based on convolutional neural networks , 2016, 2016 IEEE International Conference on Signal and Image Processing (ICSIP).

[26]  Guangyu Zhou,et al.  Speaker identification based on discriminative vector quantization , 2003, 2003 46th Midwest Symposium on Circuits and Systems.

[27]  Berat A. Erol,et al.  Speaker Recognition for Robotic Control via an IoT Device , 2018, 2018 World Automation Congress (WAC).

[28]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[29]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Aweem Ashar,et al.  Speaker Identification Using a Hybrid CNN-MFCC Approach , 2020, 2020 International Conference on Emerging Trends in Smart Technologies (ICETST).

[31]  Goutam Saha,et al.  A Comparative Study of Feature Extraction Algorithms on ANN Based Speaker Model for Speaker Recognition Applications , 2004, ICONIP.