Text-dependent and text-independent speaker recognition of reverberant speech based on CNN

Speaker recognition is one of several biometric recognition systems owing to its high importance in numerous applications of security and telecommunications. The key aspiration of speaker recognition systems is to know who is speaking depending on voice characteristics. This paper presents an extensive study of speaker recognition in both text-dependent and text-independent cases. Convolutional Neural Network (CNN) based feature extraction is extended to the text-dependent and text-independent speaker recognition tasks. In addition, the effect of reverberation on the speaker recognition system is addressed. All speech signals are converted into images by obtaining their spectrograms. Two proposed CNN models are presented for efficient speaker recognition from clean and reverberant speech signals. They depend on image processing concepts applied on spectrograms of speech signals. One of the proposed models is compared with a conventional Benchmark model in the text-independent scenario. The performance of the recognition system is measured by the recognition rate in the cases of clean and reverberant speech.

[1]  Zia Saquib,et al.  A Survey on Automatic Speaker Recognition Systems , 2010, FGIT-SIP/MulGraB.

[2]  Suphakant Phimoltares,et al.  Speech and music classification using hybrid Form of spectrogram and fourier transformation , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[3]  Tudor Barbu A Supervised Text-Independent Speaker Recognition Approach , 2007 .

[4]  Yusuke Hioka,et al.  Effect of adding artificial reverberation to speech-like masking sound , 2016 .

[5]  Thilo Stadelmann,et al.  Speaker identification and clustering using convolutional neural networks , 2016 .

[6]  K Nishanth Identification of Diabetic Maculopathy Stages using Fundus Images , 2015 .

[7]  Marc'Aurelio Ranzato,et al.  Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Abraham Thomas,et al.  Comparison of Text Independent Speaker Identification Systems using GMM and i-Vector Methods , 2017 .

[9]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[10]  Alan V. Oppenheim,et al.  Speech spectrograms using the fast Fourier transform , 1970, IEEE Spectrum.

[11]  Masashi Unoki,et al.  MTF-based method of blind estimation of reverberation time in room acoustics , 2008, 2008 16th European Signal Processing Conference.

[12]  Bayya Yegnanarayana,et al.  Enhancement of reverberant speech using LP residual signal , 2000, IEEE Trans. Speech Audio Process..

[13]  Ruili Wang,et al.  Speaker identification features extraction methods: A systematic review , 2017, Expert Syst. Appl..

[14]  Dimitri Palaz,et al.  Analysis of CNN-based speech recognition system using raw speech as input , 2015, INTERSPEECH.

[15]  Yaming Wang,et al.  Robust Text-independent Speaker Identification in a Time-varying Noisy Environment , 2012, J. Softw..

[16]  Vani A. Hiremani Speaker Recognition: A Survey , 2015 .

[17]  Hang Su,et al.  Combining Speech and Speaker Recognition - A Joint Modeling Approach , 2018 .

[18]  Patrick A. Naylor,et al.  Reverberant speech recognition: A phoneme analysis , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[19]  R Togneri,et al.  An Overview of Speaker Identification: Accuracy and Robustness Issues , 2011, IEEE Circuits and Systems Magazine.

[20]  I. Elamvazuthi,et al.  Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques , 2010, ArXiv.