Blind Estimation of the Speech Transmission Index for Speech Quality Prediction

The speech transmission index (STI) of a listening position within a given room indicates the quality and intelligibility of speech uttered in that room. The measure is very reliable for predicting speech intelligibility in many room conditions but requires an STI measurement of the impulse response for the room. We present a method for blindly estimating the STI without measuring or modeling the impulse response of the room using deep convolutional neural networks. Our model is trained entirely using simulated room impulse responses combined with clean speech examples from the DAPS dataset [1] and works directly on PCM audio. Our experiments show that our method predicts true STI with a high degree of accuracy – an average error of under 4%. It can also distinguish between different STI conditions to a level of granularity that is comparable to humans.

[1]  T. Houtgast,et al.  The Modulation Transfer Function in Room Acoustics as a Predictor of Speech Intelligibility , 1973 .

[2]  Guy-Bart Stan,et al.  Comparison of different impulse response measurement techniques , 2002 .

[3]  Emanuel A. P. Habets,et al.  Blind estimation of reverberation time based on the distribution of signal decay rates , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5]  Bryan Pardo,et al.  Predicting algorithm efficacy for adaptive multi-cue source separation , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[6]  Jesper Jensen,et al.  A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Douglas L. Jones,et al.  Blind estimation of reverberation time. , 2003, The Journal of the Acoustical Society of America.

[8]  John S. Bradley,et al.  A just noticeable difference in C50 for speech , 1999 .

[9]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[10]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[11]  T Houtgast,et al.  A physical method for measuring speech-transmission quality. , 1980, The Journal of the Acoustical Society of America.

[12]  Masashi Unoki,et al.  Blind method of estimating speech transmission index from reverberant speech signals , 2013, 21st European Signal Processing Conference (EUSIPCO 2013).

[13]  R. Maas,et al.  A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research , 2016, EURASIP Journal on Advances in Signal Processing.

[14]  Haizhou Li,et al.  Learning to estimate reverberation time in noisy and reverberant rooms , 2015, INTERSPEECH.

[15]  Gautham J. Mysore,et al.  Can we Automatically Transform Speech Recorded on Common Consumer Devices in Real-World Environments into Professional Production Quality Speech?—A Dataset, Insights, and Challenges , 2015, IEEE Signal Processing Letters.

[16]  Thomas Sporer,et al.  PEAQ - The ITU Standard for Objective Measurement of Perceived Audio Quality , 2000 .

[17]  Tiago H. Falk,et al.  A Non-Intrusive Quality and Intelligibility Measure of Reverberant and Dereverberated Speech , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  M. Schroeder Integrated‐impulse method measuring sound decay without using impulses , 1979 .

[19]  Martin Vetterli,et al.  FRIDA: FRI-based DOA estimation for arbitrary array layouts , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Herman J. M. Steeneken,et al.  Past, present and future of the speech transmission index , 2002 .