Improvement and Assessment of Spectro-Temporal Modulation Analysis for Speech Intelligibility Estimation

Several recent high-performing intelligibility estimators of acoustically degraded speech signals employ temporal modulation analysis. In this paper, we investigate the utility of using both spectroand temporal-modulation for estimating speech intelligibility. We modified a pre-existing speech intelligibility estimation scheme (STMI) that was inspired by human auditory spectro-temporal modulation analysis. We produced several variants of the modified STMI and assessed their intelligibility prediction accuracy, in comparison with several highperforming estimators. Among the estimators tested, one of the STMI variants and eSTOI performed consistently well on both noisy and reverberated speech. These results suggest that spectro-temporal modulation analysis is useful for certain degradation conditions such as modulated noise and reverberation.

[1]  Deliang Wang,et al.  Role of mask pattern in intelligibility of ideal binary-masked noisy speech. , 2009, The Journal of the Acoustical Society of America.

[2]  S. Shamma,et al.  Spectro-temporal modulation transfer functions and speech intelligibility. , 1999, The Journal of the Acoustical Society of America.

[3]  Torsten Dau,et al.  Effects of manipulating the signal-to-noise envelope power ratio on speech intelligibility. , 2015, The Journal of the Acoustical Society of America.

[4]  Yi Hu,et al.  A comparative intelligibility study of single-microphone noise reduction algorithms. , 2007, The Journal of the Acoustical Society of America.

[5]  Tiago H. Falk,et al.  A Non-Intrusive Quality and Intelligibility Measure of Reverberant and Dereverberated Speech , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  R. Plomp,et al.  Effect of temporal envelope smearing on speech reception. , 1994, The Journal of the Acoustical Society of America.

[7]  Jesper Jensen,et al.  An evaluation of objective measures for intelligibility prediction of time-frequency weighted noisy speech. , 2011, The Journal of the Acoustical Society of America.

[8]  Torsten Dau,et al.  Prediction of speech intelligibility based on an auditory preprocessing model , 2010, Speech Commun..

[9]  B. Kollmeier,et al.  Separable spectro-temporal Gabor filter bank features: Reducing the complexity of robust features for automatic speech recognition. , 2015, The Journal of the Acoustical Society of America.

[10]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[11]  B. Kollmeier,et al.  Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. , 2012, The Journal of the Acoustical Society of America.

[12]  Søren Jørgensen,et al.  Modeling speech intelligibility based on the signal-to-noise envelope power ratio , 2014 .

[13]  Kuansan Wang,et al.  Self-normalization and noise-robustness in early auditory representations , 1994, IEEE Trans. Speech Audio Process..

[14]  Kuansan Wang,et al.  Auditory representations of acoustic signals , 1992, IEEE Trans. Inf. Theory.

[15]  Torsten Dau,et al.  A multi-resolution envelope-power based model for speech intelligibility. , 2013, The Journal of the Acoustical Society of America.

[16]  Jesper Jensen,et al.  An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.