Performance comparison of intrusive and non-intrusive instrumental quality measures for enhanced speech

Instrumental quality prediction of speech processed by enhancement algorithms has become crucial with the proliferation of far-field speech applications. To date, while several instrumental measures have been proposed and standardized, their performance under a wide range of acoustic conditions and enhancement algorithms is still unknown. This paper aims to fill this gap. Specifically, the performance of eleven instrumental measures are compared; four are non-intrusive measures, i.e. not requiring a clean reference signal, and seven intrusive. Simulated and recorded speech under four different acoustic conditions involving varying levels of reverberation and noise are explored, as well as processed by three single- and multi-channel enhancement algorithms. Experimental results show that a recently developed non-intrusive measure called SRMRnorm outperforms all other considered measures in terms of overall quality prediction. The well-known PESQ measure, in turn, showed to better predict the perceived amount of reverberation, followed by SRMRnorm. These results are promising, as the latter measure does not require access to a clean reference signal, thus has the potential to be used for enhancement algorithm optimization in real-time.

[1]  Tiago H. Falk,et al.  An improved non-intrusive intelligibility metric for noisy and reverberant speech , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[2]  Emanuel A. P. Habets,et al.  Speech Dereverberation Using Statistical Reverberation Models , 2010, Speech Dereverberation.

[3]  Emanuel A. P. Habets,et al.  An informed spatial filter for dereverberation in the spherical harmonic domain , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Richard C. Hendriks,et al.  Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Tiago H. Falk,et al.  Temporal Dynamics for Blind Measurement of Room Acoustical Parameters , 2010, IEEE Transactions on Instrumentation and Measurement.

[6]  Michael Keyhl,et al.  Perceptual Objective Listening Quality Assessment (POLQA), The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement Part I-Temporal Alignment , 2013 .

[7]  Rainer Martin,et al.  Speech enhancement based on minimum mean-square error estimation and supergaussian priors , 2005, IEEE Transactions on Speech and Audio Processing.

[8]  Marc Moonen,et al.  Frequency-domain criterion for the speech distortion weighted multichannel Wiener filter for robust noise reduction , 2007, Speech Commun..

[9]  Kah-Chye Tan,et al.  Postprocessing method for suppressing musical noise generated by spectral subtraction , 1998, IEEE Trans. Speech Audio Process..

[10]  James M Kates,et al.  Coherence and the speech intelligibility index. , 2004, The Journal of the Acoustical Society of America.

[11]  R. O. Schmidt,et al.  Multiple emitter location and signal Parameter estimation , 1986 .

[12]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[13]  Emanuel A. P. Habets,et al.  A study on speech quality and speech intelligibility measures for quality assessment of single-channel dereverberation algorithms , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[14]  Mohamed-Slim Alouini,et al.  Instantly decodable network coding for real-time device-to-device communications , 2016, EURASIP J. Adv. Signal Process..

[15]  Emanuel A. P. Habets,et al.  Speech Enhancement in the STFT Domain , 2011, Springer Briefs in Electrical and Computer Engineering.

[16]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[17]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  J. Beerends,et al.  Perceptual Objective Listening Quality Assessment ( POLQA ) , The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement Part II – Perceptual Model , 2013 .

[19]  Sugato Chakravarty,et al.  Method for the subjective assessment of intermedi-ate quality levels of coding systems , 2001 .

[20]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[21]  Jelena Kovacevic,et al.  Quantitative Bioimaging: Signal Processing in Light Microscopy [From the Guest Editors] , 2015, IEEE Signal Process. Mag..

[22]  J. Berger,et al.  P.563—The ITU-T Standard for Single-Ended Speech Quality Assessment , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Stephan Gerlach,et al.  Combination of MVDR beamforming and single-channel spectral processing for enhancing noisy and reverberant speech , 2015, EURASIP J. Adv. Signal Process..

[24]  Sebastian Möller,et al.  Speech Quality Estimation: Models and Trends , 2011, IEEE Signal Processing Magazine.

[25]  Jesper Jensen,et al.  A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Sven Nordholm,et al.  Multichannel Signal Enhancement Algorithms for Assisted Listening Devices: Exploiting spatial diversity using multiple microphones , 2015, IEEE Signal Processing Magazine.

[27]  Tiago H. Falk,et al.  A Non-Intrusive Quality and Intelligibility Measure of Reverberant and Dereverberated Speech , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  K. U. Simmer,et al.  Multi-microphone noise reduction techniques as front-end devices for speech recognition , 2000, Speech Commun..

[29]  J.-M. Boucher,et al.  A New Method Based on Spectral Subtraction for Speech Dereverberation , 2001 .

[30]  Doh-Suk Kim,et al.  ANIQUE+: A new American national standard for non-intrusive estimation of narrowband speech quality , 2007, Bell Labs Technical Journal.

[31]  James M. Kates,et al.  Objective Quality and Intelligibility Prediction for Users of Assistive Listening Devices: Advantages and limitations of existing tools , 2015, IEEE Signal Processing Magazine.

[32]  R. Maas,et al.  A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research , 2016, EURASIP Journal on Advances in Signal Processing.

[33]  Soo Ngee Koh,et al.  Enhanced Itakura measure incorporating masking properties of human auditory system , 2003, Signal Process..

[34]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[35]  Yi Hu,et al.  Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions. , 2009, The Journal of the Acoustical Society of America.

[36]  I. McCowan,et al.  The multi-channel Wall Street Journal audio visual corpus (MC-WSJ-AV): specification and initial experiments , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..