GMM and CNN Hybrid Method for Short Utterance Speaker Recognition

During the last few years, the speaker recognition technique has been widely attractive for its extensive application in many fields, such as speech communications, domestics services, and smart terminals. As a critical method, the Gaussian mixture model (GMM) makes it possible to achieve the recognition capability that is close to the hearing ability of human in a long speech. However, the GMM is failing to recognize a short utterance speaker with a high accuracy. Aiming at solving this problem, in this paper, we propose a novel model to enhance the recognition accuracy of the short utterance speaker recognition system. Different from traditional models based on the GMM, we design a method to train a convolutional neural network to process spectrograms, which can describe speakers better. Thus, the recognition system gains the considerable accuracy as well as the reasonable convergence speed. The experiment results show that our model can help to decrease the equal error rate of the recognition from 4.9% to 2.5%.

[1]  Zhaoquan Cai,et al.  Facial age estimation by using stacked feature composition and selection , 2016, The Visual Computer.

[2]  Zhen Zhou Wang,et al.  Unsupervised Recognition and Characterization of the Reflected Laser Lines for Robotic Gas Metal Arc Welding , 2017, IEEE Transactions on Industrial Informatics.

[3]  Jin Li,et al.  Privacy-preserving outsourced classification in cloud computing , 2017, Cluster Computing.

[4]  Jürgen Schmidhuber,et al.  Multi-column deep neural network for traffic sign classification , 2012, Neural Networks.

[5]  Patrick Kenny,et al.  Eigenvoice modeling with sparse training data , 2005, IEEE Transactions on Speech and Audio Processing.

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Ivan Lee,et al.  Fall Recovery Subactivity Recognition With RGB-D Cameras , 2016, IEEE Transactions on Industrial Informatics.

[8]  Douglas A. Reynolds,et al.  Deep Neural Network Approaches to Speaker and Language Recognition , 2015, IEEE Signal Processing Letters.

[9]  Sridha Sridharan,et al.  Making Confident Speaker Verification Decisions With Minimal Speech , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[11]  Sridha Sridharan,et al.  Factor analysis subspace estimation for speaker verification with short utterances , 2008, INTERSPEECH.

[12]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[13]  Qing Wang,et al.  Distance metric optimization driven convolutional neural network for age invariant face recognition , 2018, Pattern Recognit..

[14]  Bin Liang,et al.  High-dimension space projection-based biometric encryption for fingerprint with fuzzy minutia , 2016, Soft Comput..

[15]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Xingming Sun,et al.  Fast Motion Estimation Based on Content Property for Low-Complexity H.265/HEVC Encoder , 2016, IEEE Transactions on Broadcasting.

[17]  Jin Li,et al.  Secure attribute-based data sharing for resource-limited users in cloud computing , 2018, Comput. Secur..

[18]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[19]  Kai-Tai Song,et al.  A Study on Speech Recognition Control for a Surgical Robot , 2017, IEEE Transactions on Industrial Informatics.

[20]  Q. M. Wu,et al.  Fingerprint Liveness Detection from Different Fingerprint Materials Using Convolutional Neural Network and Principal Component Analysis , 2018 .

[21]  Jin Li,et al.  Identity-Based Encryption with Outsourced Revocation in Cloud Computing , 2015, IEEE Transactions on Computers.

[22]  Bo Tang,et al.  Intelligent Fault Diagnosis of the High-Speed Train With Big Data Based on Deep Neural Networks , 2017, IEEE Transactions on Industrial Informatics.

[23]  Jin Li,et al.  Insight of the protection for data security under selective opening attacks , 2017, Inf. Sci..

[24]  Yongzhao Zhan,et al.  Speech Emotion Recognition Using CNN , 2014, ACM Multimedia.

[25]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[26]  Ruimao Zhang,et al.  Cost-Effective Active Learning for Deep Image Classification , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[27]  Fatos Xhafa,et al.  L-EncDB: A lightweight framework for privacy-preserving data queries in cloud computing , 2015, Knowl. Based Syst..

[28]  Jie Yuan,et al.  A twice face recognition algorithm , 2016, Soft Comput..

[29]  Zexi Hu,et al.  Extended compressed tracking via random projection based on MSERs and online LS-SVM learning , 2016, Pattern Recognit..

[30]  Lui Sha,et al.  Data-Centered Runtime Verification of Wireless Medical Cyber-Physical System , 2017, IEEE Transactions on Industrial Informatics.

[31]  Eliathamby Ambikairajah,et al.  A segment selection technique for speaker verification , 2010, Speech Commun..