Quality Evaluation of Reverberant Speech Based on Deep Learning

This paper presents an efficient approach for classification of speech signals as reverberant or not. The reverberation is a severe effect encountered in closed room. So, it may affect subsequent processes and deteriorate speech processing system performance. The spectrograms are utilized as images generated from speech signals to be classified with deep convolutional neural networks. Spectrogram and MFCC are used as features to be classified with Long Short Term Recurrent Neural Network (LSTM RNN). Two models are presented and compared. Simulation results up to 100% classification accuracy are obtained. This can help in perform an initial step in any speech processing system that comprises quality level classification.

[1]  Joel Larsson,et al.  Optimizing text-independent speaker recognition using an LSTM neural network , 2014 .

[2]  Patrick A. Naylor,et al.  Perceptual and instrumental evaluation of the perceived level of reverberation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Xiangang Li,et al.  Modeling speaker variability using long short-term memory networks for speech recognition , 2015, INTERSPEECH.

[4]  Fathi E. Abd El-Samie,et al.  Information Security for Automatic Speaker Identification , 2011 .

[5]  Ping Yu,et al.  Deep Neural Networks for Voice Quality Assessment Based on the GRBAS Scale , 2016, INTERSPEECH.

[6]  Christian Wolf,et al.  Action Classification in Soccer Videos with Long Short-Term Memory Recurrent Neural Networks , 2010, ICANN.

[7]  Manfred R. Schroeder,et al.  Natural Sounding Artificial Reverberation , 1962 .

[8]  Oliver Durr,et al.  Speaker identification and clustering using convolutional neural networks , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[9]  Masashi Unoki,et al.  MTF-based method of blind estimation of reverberation time in room acoustics , 2008, 2008 16th European Signal Processing Conference.

[10]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  John H. L. Hansen,et al.  An Investigation of Deep-Learning Frameworks for Speaker Verification Antispoofing , 2017, IEEE Journal of Selected Topics in Signal Processing.

[12]  Haizhou Li,et al.  Spectrogram Image Feature for Sound Event Classification in Mismatched Conditions , 2011, IEEE Signal Processing Letters.

[13]  Patrick A. Naylor,et al.  Noise-robust reverberation time estimation using spectral decay distributions with reduced computational cost , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Keikichi Hirose,et al.  Spectrogram based features selection using multiple kernel learning for speech/music discrimination , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Reinhold Häb-Umbach,et al.  A Model-Based Approach to Joint Compensation of Noise and Reverberation for Speech Recognition , 2011, Robust Speech Recognition of Uncertain or Missing Data.

[16]  Jean-Jacques E. Slotine,et al.  Audio classification from time-frequency texture , 2008, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.