A network model of speaker identification with new feature extraction methods and asymmetric BLSTM

Abstract Speaker identification has recently attracted considerable attention in speaker recognition. Environmental noise and short utterance pose two challenges for accurate speaker identification. In this paper, a network model with new feature extraction methods and a new bi-directional long short-term memory network is proposed to identify the speaker. Specifically, this paper proposes to combine the mel-spectrogram and cochleagram to generate two new features, named MC-spectrogram and MC-cube. They have stronger robustness and can obtain more abundant voiceprint feature in the short utterance. Then, multi-dimensional CNNs are applied to process MC-spectrogram and MC-cube features correspondingly. They contain multi-dimensional convolution kernels, which can learn the voiceprint features more efficiently. In addition, the context information is ignored by CNN. And the forward voiceprint features are more crucial because the voiceprint features concentrate on the back part in the short utterance. Asymmetric bi-directional long short-time memory network (ABLSTM) is proposed to further learn the voiceprint features in global feature learning. It can improve the accuracy of speaker identification. According to the diverse dimension of input, the proposed network model can manifest diverse patterns, which are named Audio-1DCNN-ABLSTM, MCS(MC-spectrogram)-2DCNN-ABLSTM and MCC(MC-cube)-3DCNN-ABLSTM. From the experimental results, it is shown that the diverse patterns can achieve superior accuracy and robustness in the short utterance with extra environmental noise. Furthermore, the proposed network model provides a reliable solution in text-independent speaker identification.

[1]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[3]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Sadaoki Furui,et al.  Recent advances in speaker recognition , 1997, Pattern Recognit. Lett..

[5]  John H. L. Hansen,et al.  Modelling and compensation for language mismatch in speaker verification , 2018, Speech Commun..

[6]  Jungwon Lee,et al.  Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition , 2017, INTERSPEECH.

[7]  Haijun Zhang,et al.  Understanding Subtitles by Character-Level Sequence-to-Sequence Learning , 2017, IEEE Transactions on Industrial Informatics.

[8]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9]  Haizhou Li,et al.  Low-Variance Multitaper MFCC Features: A Case Study in Robust Speaker Verification , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[11]  Thomas Fang Zheng,et al.  An overview of robustness related issues in speaker recognition , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[12]  Simon Haykin,et al.  Intelligent Signal Processing , 2001 .

[13]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[14]  Pavel Matějka,et al.  End-to-end DNN based text-independent speaker recognition for long and short utterances , 2020, Comput. Speech Lang..

[15]  Mohan M. Trivedi,et al.  Multi-scale volumes for deep object detection and localization , 2017, Pattern Recognit..

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Thilo Stadelmann,et al.  Speaker identification and clustering using convolutional neural networks , 2016 .

[18]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Wei Shi,et al.  Dilated convolution neural network with LeakyReLU for environmental sound classification , 2017, 2017 22nd International Conference on Digital Signal Processing (DSP).

[20]  Linlin Liu,et al.  Learning to Match Clothing From Textual Feature-Based Compatible Relationships , 2020, IEEE Transactions on Industrial Informatics.

[21]  Goutam Saha,et al.  Quality Measures for Speaker Verification with Short Utterances , 2019, Digit. Signal Process..

[22]  Paolo Napoletano,et al.  Discriminative Deep Audio Feature Embedding for Speaker Recognition in the Wild , 2018, 2018 IEEE 8th International Conference on Consumer Electronics - Berlin (ICCE-Berlin).

[23]  William M. Campbell,et al.  Advances in channel compensation for SVM speaker recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[24]  Tomi Kinnunen,et al.  Who Do I Sound like? Showcasing Speaker Recognition Technology by Youtube Voice Search , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[26]  Nilanjan Dey,et al.  Pattern Mining Approaches Used in Sensor-Based Biometric Recognition: A Review , 2019, IEEE Sensors Journal.

[27]  Patrick Kenny A small footprint i-vector extractor , 2012, Odyssey.

[28]  Sridhar Krishna Nemala,et al.  A Multistream Feature Framework Based on Bandpass Modulation Filtering for Robust Speech Recognition , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Yu Zhang,et al.  Simple Recurrent Units for Highly Parallelizable Recurrence , 2017, EMNLP.

[30]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[31]  Nilanjan Dey,et al.  Audio Processing and Speech Recognition: Concepts, Techniques and Research Overviews , 2019 .

[32]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Haibin Ling,et al.  Attention guided deep audio-face fusion for efficient speaker naming , 2019, Pattern Recognit..

[34]  Konstantinos G. Margaritis,et al.  Development of a Text-Dependent Speaker Identification System with the OGI Toolkit , 2002 .

[35]  Feng Cheng,et al.  Visual speaker authentication with random prompt texts by a dual-task CNN framework , 2018, Pattern Recognit..

[36]  D. D. Greenwood A cochlear frequency-position function for several species--29 years later. , 1990, The Journal of the Acoustical Society of America.

[37]  Hao Zheng,et al.  AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[38]  L.D. Jackel,et al.  An analog neural network processor and its application to high-speed character recognition , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[39]  Jianfeng Zhao,et al.  Speech emotion recognition using deep 1D & 2D CNN LSTM networks , 2019, Biomed. Signal Process. Control..

[40]  Yibin Li,et al.  Region-sequence based six-stream CNN features for general and fine-grained human action recognition in videos , 2018, Pattern Recognit..

[41]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[42]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[43]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[44]  Sridha Sridharan,et al.  Improving PLDA speaker verification performance using domain mismatch compensation techniques , 2018, Comput. Speech Lang..

[45]  Sanjeev Khudanpur,et al.  Speaker Recognition for Multi-speaker Conversations Using X-vectors , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).