Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features

This study proposes a cross-domain multi-objective speech assessment model, called MOSA-Net, which can simultaneously estimate the speech quality, intelligibility, and distortion assessment scores of an input speech signal. MOSA-Net comprises a convolutional neural network and bidirectional long short-term memory architecture for representation extraction, and a multiplicative attention layer and a fully connected layer for each assessment metric prediction. Additionally, cross-domain features (spectral and time-domain features) and latent representations from self-supervised learned (SSL) models are used as inputs to combine rich acoustic information to obtain more accurate assessments. Experimental results show that in both seen and unseen noise environments, MOSA-Net can improve the linear correlation coefficient (LCC) scores in perceptual evaluation of speech quality (PESQ) prediction, compared to Quality-Net, an existing single-task model for PESQ prediction, and improve LCC scores in short-time objective intelligibility (STOI) prediction, compared to STOI-Net, an existing single-task model for STOI prediction. Moreover, MOSA-Net can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. Experimental results show that MOSA-Net can improve LCC scores in mean opinion score (MOS) predictions, compared to MOS-SSL, a strong single-task model for MOS prediction. We further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE (QIA-SE) approach. Experimental results show that QIA-SE outperforms the baseline SE system with improved PESQ scores in both seen and unseen noise environments over a baseline SE model.

[1]  Ryandhimas E. Zezario,et al.  MTI-Net: A Multi-Target Speech Intelligibility Prediction Model , 2022, INTERSPEECH.

[2]  Ryandhimas E. Zezario,et al.  MBI-Net: A Non-Intrusive Multi-Branched Speech Intelligibility Prediction Model for Hearing Aids , 2022, INTERSPEECH.

[3]  Fei Chen,et al.  Nonintrusive objective measurement of speech intelligibility: A review of methodology , 2022, Biomed. Signal Process. Control..

[4]  Tim Fingscheidt,et al.  Deep Noise Suppression Maximizing Non-Differentiable PESQ Mediated by a Non-Intrusive PESQNet , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Yu Tsao,et al.  MetricGAN-U: Unsupervised Speech Enhancement/ Dereverberation Based Only on Noisy/ Reverberated Speech , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  J. Yamagishi,et al.  Generalization Ability of MOS Prediction Networks , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  J. Yamagishi,et al.  SVSNet: An End-to-End Speaker Voice Similarity Assessment Model , 2021, IEEE Signal Processing Letters.

[8]  J. Hansen,et al.  An intrusive method for estimating speech intelligibility from noisy and distorted signals. , 2021, The Journal of the Acoustical Society of America.

[9]  Donald S. Williamson,et al.  Incorporating Embedding Vectors from a Human Mean-Opinion Score Prediction Model for Monaural Speech Enhancement , 2021, Interspeech 2021.

[10]  Ruslan Salakhutdinov,et al.  HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Donald S. Williamson,et al.  An End-To-End Non-Intrusive Model for Subjective and Objective Real-World Speech Assessment Using a Multi-Task Framework , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Yu Tsao,et al.  MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement , 2021, Interspeech.

[13]  Yist Y. Lin,et al.  Utilizing Self-supervised Representations for MOS Prediction , 2021, Interspeech.

[14]  Tao Qin,et al.  MBNET: MOS Prediction for Synthesized Speech with Mean-Bias Network , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Yu Tsao,et al.  MoEVC: A Mixture of Experts Voice Conversion System With Sparse Gating Mechanism for Online Computation Acceleration , 2021, 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[16]  Yu Tsao,et al.  Speech Enhancement with Zero-Shot Model Selection , 2020, 2021 29th European Signal Processing Conference (EUSIPCO).

[17]  Ross Cutler,et al.  Dnsmos: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Hoirin Kim,et al.  Neural MOS Prediction for Synthesized Speech Using Multi-Task Learning with Spoofing Detection and Spoofing Type Classification , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[19]  James M. Kates,et al.  The Hearing-Aid Speech Perception Index (HASPI) Version 2 , 2020, Speech Commun..

[20]  Yu Tsao,et al.  InQSS: a speech intelligibility assessment model using a multi-task learning network , 2021, ArXiv.

[21]  Jesper Jensen,et al.  Speech Intelligibility Prediction Using Spectro-Temporal Modulation Analysis , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  A Deep Learning-Based Time-Domain Approach for Non-Intrusive Speech Quality Assessment , 2020, 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[23]  Chao Ma,et al.  Optimal scale-invariant signal-to-noise ratio and curriculum learning for monaural multi-speaker speech separation in noisy environment , 2020, 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[24]  Yu Tsao,et al.  STOI-Net: A Deep Learning based Non-Intrusive Speech Intelligibility Assessment Model , 2020, ArXiv.

[25]  Youngmoon Jung,et al.  Dynamic Noise Embedding: Noise Aware Training and Adaptation for Speech Enhancement , 2020, 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[26]  Donald S. Williamson,et al.  A Pyramid Recurrent Network for Predicting Crowdsourced Speech-Quality Ratings of Real-World Signals , 2020, INTERSPEECH.

[27]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[28]  Donald S. Williamson,et al.  An Attention Enhanced Multi-Task Model for Objective Speech Assessment in Real-World Environments , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Søren Holdt Jensen,et al.  A Neural Network for Monaural Intrusive Speech Intelligibility Prediction , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Marc Delcroix,et al.  Speech Enhancement Using Self-Adaptation and Multi-Head Self-Attention , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Jungwon Lee,et al.  T-GSA: Transformer with Gaussian-Weighted Self-Attention for Speech Enhancement , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Ryandhimas E. Zezario,et al.  Specialized Speech Enhancement Model Selection Based on Learned Non-Intrusive Quality Assessment Metric , 2019, INTERSPEECH.

[33]  Shih-Hau Fang,et al.  Speaker-Aware Deep Denoising Autoencoder with Embedded Speaker Identity for Speech Enhancement , 2019, INTERSPEECH.

[34]  Tomohiro Nakatani,et al.  SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures , 2019, IEEE Journal of Selected Topics in Signal Processing.

[35]  Shou-De Lin,et al.  MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement , 2019, ICML.

[36]  Sebastian Möller,et al.  Non-intrusive Speech Quality Assessment for Super-wideband Speech Communication Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Bernd T. Meyer,et al.  Improving Deep Models of Speech Quality Prediction through Voice Activity Detection and Entropy-based Measures , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Yu Tsao,et al.  MOSNet: Deep Learning based Objective Assessment for Voice Conversion , 2019, INTERSPEECH.

[39]  Johannes Gehrke,et al.  Non-intrusive Speech Quality Assessment Using Neural Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Jonathan Le Roux,et al.  SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[42]  Patrick Olaniyi Olabisi,et al.  An Improved Logistic Function for Mapping Raw Scores of Perceptual Evaluation of Speech Quality (PESQ) , 2018, Journal of Engineering Research and Reports.

[43]  Paris Smaragdis,et al.  Blind Estimation of the Speech Transmission Index for Speech Quality Prediction , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  R. Sarpong,et al.  Bio-inspired synthesis of xishacorenes A, B, and C, and a new congener from fuscol† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c9sc02572c , 2019, Chemical science.

[45]  Bernd T. Meyer,et al.  Prediction of Perceived Speech Quality Using Deep Machine Listening , 2018, INTERSPEECH.

[46]  Yu Tsao,et al.  Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM , 2018, INTERSPEECH.

[47]  Yoshua Bengio,et al.  Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[48]  Chih-Hao Fang,et al.  台灣地區噪音下漢語語音聽辨測驗之軟體發展;Software Development of Taiwan Mandarin Hearing In Noise Test , 2018 .

[49]  Jan Mark de Haan,et al.  Nonintrusive Speech Intelligibility Prediction Using Convolutional Neural Networks , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[50]  Tomohiro Nakatani,et al.  Single Channel Target Speaker Extraction and Recognition with Speaker Beam , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Yu Tsao,et al.  Joint Dictionary Learning-Based Non-Negative Matrix Factorization for Voice Conversion to Improve Speech Intelligibility After Oral Surgery , 2017, IEEE Transactions on Biomedical Engineering.

[52]  Hemant A. Patil,et al.  Effectiveness of ideal ratio mask for non-intrusive quality assessment of noise suppressed speech , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[53]  Yu Tsao,et al.  Raw waveform-based speech enhancement by fully convolutional networks , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[54]  W. Bastiaan Kleijn,et al.  Machine learning based non-intrusive quality estimation with an augmented feature set , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  Jesper Jensen,et al.  An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[56]  Weisi Lin,et al.  Bag-of-words representation for non-intrusive speech quality assessment , 2015, 2015 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP).

[57]  Wissam A. Jassim,et al.  Prediction of Speech Intelligibility Using a Neurogram Orthogonal Polynomial Measure (NOPM) , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[58]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[59]  Mahdi Eftekhari,et al.  An ensemble learning model for single-ended speech quality assessment using multiple-level signal decomposition method , 2014, 2014 4th International Conference on Computer and Knowledge Engineering (ICCKE).

[60]  Patrick A. Naylor,et al.  A non-intrusive PESQ measure , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[61]  Weisi Lin,et al.  Non-intrusive quality assessment for enhanced speech signals based on spectro-temporal features , 2014, 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW).

[62]  Philipos C. Loizou,et al.  Predicting the intelligibility of reverberant speech for cochlear implant listeners with a non-intrusive intelligibility measure , 2013, Biomed. Signal Process. Control..

[63]  Arun Kumar,et al.  Non-intrusive speech quality assessment using several combinations of auditory features , 2013, Int. J. Speech Technol..

[64]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[65]  Michael Keyhl,et al.  Perceptual Objective Listening Quality Assessment (POLQA), The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement Part I-Temporal Alignment , 2013 .

[66]  Andrew Hines,et al.  Speech intelligibility prediction using a Neurogram Similarity Index Measure , 2012, Speech Commun..

[67]  W. Marsden I and J , 2012 .

[68]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[69]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[70]  Wei Wei,et al.  A New Neural Network Measure for Objective Speech Quality Evaluation , 2010, 2010 6th International Conference on Wireless Communications Networking and Mobile Computing (WiCOM).

[71]  Tiago H. Falk,et al.  A Non-Intrusive Quality and Intelligibility Measure of Reverberant and Dereverberated Speech , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[72]  Weisi Lin,et al.  Non-intrusive Speech Quality Assessment with Support Vector Regression , 2010, MMM.

[73]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[74]  Ayman Radwan,et al.  Non-intrusive single-ended speech quality assessment in VoIP , 2007, Speech Commun..

[75]  Hua Yuan,et al.  Single-Ended Quality Measurement of Noise Suppressed Speech Based on Kullback-Leibler Distances , 2007, J. Multim..

[76]  Tiago H. Falk,et al.  Single-Ended Speech Quality Measurement Using Machine Learning Methods , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[77]  W. Bastiaan Kleijn,et al.  Low-Complexity, Nonintrusive Speech Quality Assessment , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[78]  Jacob Benesty,et al.  New insights into the noise reduction Wiener filter , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[79]  Abdulhussain E. Mahdi,et al.  A new single-ended measure for assessment of speech quality , 2006, INTERSPEECH.

[80]  Jayne B Ahlstrom,et al.  Word recognition in noise at higher-than-normal levels: decreases in scores and increases in masking. , 2005, The Journal of the Acoustical Society of America.

[81]  Thomas Lenarz,et al.  Amplitude-Mapping Effects on Speech Intelligibility With Unilateral and Bilateral Cochlear Implants , 2005, Ear and hearing.

[82]  James M Kates,et al.  Coherence and the speech intelligibility index. , 2004, The Journal of the Acoustical Society of America.

[83]  Raymond L. Goldsworthy,et al.  Analysis of speech-based Speech Transmission Index methods with implications for nonlinear operations. , 2004, The Journal of the Acoustical Society of America.

[84]  Methods for objective and subjective assessment of quality Perceptual evaluation of speech quality ( PESQ ) : An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs , 2002 .

[85]  Saeed Gazor,et al.  An adaptive KLT approach for speech enhancement , 2001, IEEE Trans. Speech Audio Process..

[86]  G. Studebaker,et al.  Monosyllabic word recognition at higher-than-normal speech and noise levels. , 1999, The Journal of the Acoustical Society of America.

[87]  John H. L. Hansen,et al.  An effective quality evaluation protocol for speech enhancement algorithms , 1998, ICSLP.

[88]  Pascal Scalart,et al.  Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[89]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[90]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[91]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[92]  T Houtgast,et al.  A physical method for measuring speech-transmission quality. , 1980, The Journal of the Acoustical Society of America.

[93]  J. C. Steinberg,et al.  Factors Governing the Intelligibility of Speech Sounds , 1945 .