Blind C50 estimation from single-channel speech using a convolutional neural network

The early-to-late reverberation energy ratio is an important parameter describing the acoustic properties of an environment. C50, i.e., the ratio between the first 50 ms and the remaining late energy, affects the perceived clarity and intelligibility of speech, and can be used as a design parameter in mixed reality applications or to predict the performance of speech recognition systems. While established methods exist to derive C50 from impulse response measurements, such measurements are rarely available in practice. Recently, methods have been proposed to estimate C50 blindly from reverberant speech signals. Here, a convolutional neural network (CNN) architecture with a long short-term memory (LSTM) layer is proposed to estimate C50 blindly. The CNN-LSTM operates directly on the spectrogram of variable-length, noisy, reverberant utterances. A feature comparison indicates that log Mel spectrogram features with a frame size of 128 samples achieve the best performance with an average root-mean-square error of about 2.7 dB, outperforming previously proposed blind C50 estimators.

[1]  Patrick A. Naylor,et al.  A Single-Channel Non-Intrusive C50 Estimator Correlated With Speech Recognition Performance , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Nikolay D. Gaubitch,et al.  Joint Estimation Of Acoustic Parameters From Single-Microphone Speech Observations , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Sebastian Braun,et al.  Predicting Word Error Rate for Reverberant Speech , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Damian Murphy,et al.  OpenAIR: An Interactive Auralization Web Resource and Database , 2010 .

[5]  Nicholas J. Bryan Impulse Response Data Augmentation and Deep Neural Networks for Blind Room Acoustic Parameter Estimation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  John S. Bradley,et al.  A just noticeable difference in C50 for speech , 1999 .

[7]  Tomohiro Nakatani,et al.  The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[8]  Mark B. Sandler,et al.  Database of omnidirectional and B-format room impulse responses , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Tapio Lokki,et al.  Engaging concert hall acoustics is made up of temporal envelope preserving reflections. , 2011, The Journal of the Acoustical Society of America.

[10]  Alastair H. Moore,et al.  Estimation of Room Acoustic Parameters: The ACE Challenge , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  L D Braida,et al.  Intelligibility of conversational and clear speech in noise and reverberation for listeners with normal and impaired hearing. , 1994, The Journal of the Acoustical Society of America.

[12]  Atsuto Maki,et al.  A systematic study of the class imbalance problem in convolutional neural networks , 2017, Neural Networks.

[13]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[14]  Søren Holdt Jensen,et al.  The single- and multichannel audio recordings database (SMARD) , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[15]  William L. Martens,et al.  Investigating auditory room size perception with autophonic stimuli , 2013 .

[16]  Ivan Tashev,et al.  Blind Reverberation Time Estimation Using a Convolutional Neural Network , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[17]  Patrick A. Naylor,et al.  EVALUATION OF SPEECH DEREVERBERATION ALGORITHMS USING THE MARDY DATABASE , 2006 .

[18]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Birger Kollmeier,et al.  Exploring Auditory-Inspired Acoustic Features for Room Acoustic Parameter Estimation From Monaural Speech , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  John S. Bradley,et al.  Subjective evaluation of new room acoustic measures , 1995 .

[21]  Ming C. Lin,et al.  Efficient and Accurate Sound Propagation Using Adaptive Rectangular Decomposition , 2009, IEEE Transactions on Visualization and Computer Graphics.

[22]  A. Bronkhorst,et al.  Auditory distance perception in humans : A summary of past and present research , 2005 .

[23]  R. Maas,et al.  Towards a Better Understanding of the Effect of Reverberation on Speech Recognition Performance , 2010 .