论文信息 - Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features

Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features

This study proposes a cross-domain multi-objective speech assessment model, called MOSA-Net, which can simultaneously estimate the speech quality, intelligibility, and distortion assessment scores of an input speech signal. MOSA-Net comprises a convolutional neural network and bidirectional long short-term memory architecture for representation extraction, and a multiplicative attention layer and a fully connected layer for each assessment metric prediction. Additionally, cross-domain features (spectral and time-domain features) and latent representations from self-supervised learned (SSL) models are used as inputs to combine rich acoustic information to obtain more accurate assessments. Experimental results show that in both seen and unseen noise environments, MOSA-Net can improve the linear correlation coefficient (LCC) scores in perceptual evaluation of speech quality (PESQ) prediction, compared to Quality-Net, an existing single-task model for PESQ prediction, and improve LCC scores in short-time objective intelligibility (STOI) prediction, compared to STOI-Net, an existing single-task model for STOI prediction. Moreover, MOSA-Net can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. Experimental results show that MOSA-Net can improve LCC scores in mean opinion score (MOS) predictions, compared to MOS-SSL, a strong single-task model for MOS prediction. We further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE (QIA-SE) approach. Experimental results show that QIA-SE outperforms the baseline SE system with improved PESQ scores in both seen and unseen noise environments over a baseline SE model.

[1] Ryandhimas E. Zezario,et al. MTI-Net: A Multi-Target Speech Intelligibility Prediction Model , 2022, INTERSPEECH.

[2] Ryandhimas E. Zezario,et al. MBI-Net: A Non-Intrusive Multi-Branched Speech Intelligibility Prediction Model for Hearing Aids , 2022, INTERSPEECH.

[3] Fei Chen,et al. Nonintrusive objective measurement of speech intelligibility: A review of methodology , 2022, Biomed. Signal Process. Control..

[4] Tim Fingscheidt,et al. Deep Noise Suppression Maximizing Non-Differentiable PESQ Mediated by a Non-Intrusive PESQNet , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5] Yu Tsao,et al. MetricGAN-U: Unsupervised Speech Enhancement/ Dereverberation Based Only on Noisy/ Reverberated Speech , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] J. Yamagishi,et al. Generalization Ability of MOS Prediction Networks , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] J. Yamagishi,et al. SVSNet: An End-to-End Speaker Voice Similarity Assessment Model , 2021, IEEE Signal Processing Letters.

[8] J. Hansen,et al. An intrusive method for estimating speech intelligibility from noisy and distorted signals. , 2021, The Journal of the Acoustical Society of America.

[9] Donald S. Williamson,et al. Incorporating Embedding Vectors from a Human Mean-Opinion Score Prediction Model for Monaural Speech Enhancement , 2021, Interspeech 2021.

[10] Ruslan Salakhutdinov,et al. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11] Donald S. Williamson,et al. An End-To-End Non-Intrusive Model for Subjective and Objective Real-World Speech Assessment Using a Multi-Task Framework , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Yu Tsao,et al. MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement , 2021, Interspeech.

[13] Yist Y. Lin,et al. Utilizing Self-supervised Representations for MOS Prediction , 2021, Interspeech.

[14] Tao Qin,et al. MBNET: MOS Prediction for Synthesized Speech with Mean-Bias Network , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Yu Tsao,et al. MoEVC: A Mixture of Experts Voice Conversion System With Sparse Gating Mechanism for Online Computation Acceleration , 2021, 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[16] Yu Tsao,et al. Speech Enhancement with Zero-Shot Model Selection , 2020, 2021 29th European Signal Processing Conference (EUSIPCO).

[17] Ross Cutler,et al. Dnsmos: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Hoirin Kim,et al. Neural MOS Prediction for Synthesized Speech Using Multi-Task Learning with Spoofing Detection and Spoofing Type Classification , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[19] James M. Kates,et al. The Hearing-Aid Speech Perception Index (HASPI) Version 2 , 2020, Speech Commun..

[20] Yu Tsao,et al. InQSS: a speech intelligibility assessment model using a multi-task learning network , 2021, ArXiv.

[21] Jesper Jensen,et al. Speech Intelligibility Prediction Using Spectro-Temporal Modulation Analysis , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22] A Deep Learning-Based Time-Domain Approach for Non-Intrusive Speech Quality Assessment , 2020, 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[23] Chao Ma,et al. Optimal scale-invariant signal-to-noise ratio and curriculum learning for monaural multi-speaker speech separation in noisy environment , 2020, 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[24] Yu Tsao,et al. STOI-Net: A Deep Learning based Non-Intrusive Speech Intelligibility Assessment Model , 2020, ArXiv.

[25] Youngmoon Jung,et al. Dynamic Noise Embedding: Noise Aware Training and Adaptation for Speech Enhancement , 2020, 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[26] Donald S. Williamson,et al. A Pyramid Recurrent Network for Predicting Crowdsourced Speech-Quality Ratings of Real-World Signals , 2020, INTERSPEECH.

[27] Abdel-rahman Mohamed,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[28] Donald S. Williamson,et al. An Attention Enhanced Multi-Task Model for Objective Speech Assessment in Real-World Environments , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29] Søren Holdt Jensen,et al. A Neural Network for Monaural Intrusive Speech Intelligibility Prediction , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30] Marc Delcroix,et al. Speech Enhancement Using Self-Adaptation and Multi-Head Self-Attention , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31] Jungwon Lee,et al. T-GSA: Transformer with Gaussian-Weighted Self-Attention for Speech Enhancement , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32] Ryandhimas E. Zezario,et al. Specialized Speech Enhancement Model Selection Based on Learned Non-Intrusive Quality Assessment Metric , 2019, INTERSPEECH.

[33] Shih-Hau Fang,et al. Speaker-Aware Deep Denoising Autoencoder with Embedded Speaker Identity for Speech Enhancement , 2019, INTERSPEECH.

[34] Tomohiro Nakatani,et al. SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures , 2019, IEEE Journal of Selected Topics in Signal Processing.

[35] Shou-De Lin,et al. MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement , 2019, ICML.

[36] Sebastian Möller,et al. Non-intrusive Speech Quality Assessment for Super-wideband Speech Communication Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37] Bernd T. Meyer,et al. Improving Deep Models of Speech Quality Prediction through Voice Activity Detection and Entropy-based Measures , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38] Yu Tsao,et al. MOSNet: Deep Learning based Objective Assessment for Voice Conversion , 2019, INTERSPEECH.

[39] Johannes Gehrke,et al. Non-intrusive Speech Quality Assessment Using Neural Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40] Jonathan Le Roux,et al. SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41] Aaas News,et al. Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[42] Patrick Olaniyi Olabisi,et al. An Improved Logistic Function for Mapping Raw Scores of Perceptual Evaluation of Speech Quality (PESQ) , 2018, Journal of Engineering Research and Reports.

[43] Paris Smaragdis,et al. Blind Estimation of the Speech Transmission Index for Speech Quality Prediction , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44] R. Sarpong,et al. Bio-inspired synthesis of xishacorenes A, B, and C, and a new congener from fuscol† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c9sc02572c , 2019, Chemical science.

[45] Bernd T. Meyer,et al. Prediction of Perceived Speech Quality Using Deep Machine Listening , 2018, INTERSPEECH.

[46] Yu Tsao,et al. Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM , 2018, INTERSPEECH.

[47] Yoshua Bengio,et al. Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[48] Chih-Hao Fang,et al. 台灣地區噪音下漢語語音聽辨測驗之軟體發展;Software Development of Taiwan Mandarin Hearing In Noise Test , 2018 .

[49] Jan Mark de Haan,et al. Nonintrusive Speech Intelligibility Prediction Using Convolutional Neural Networks , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[50] Tomohiro Nakatani,et al. Single Channel Target Speaker Extraction and Recognition with Speaker Beam , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51] Yu Tsao,et al. Joint Dictionary Learning-Based Non-Negative Matrix Factorization for Voice Conversion to Improve Speech Intelligibility After Oral Surgery , 2017, IEEE Transactions on Biomedical Engineering.

[52] Hemant A. Patil,et al. Effectiveness of ideal ratio mask for non-intrusive quality assessment of noise suppressed speech , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[53] Yu Tsao,et al. Raw waveform-based speech enhancement by fully convolutional networks , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[54] W. Bastiaan Kleijn,et al. Machine learning based non-intrusive quality estimation with an augmented feature set , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55] Jesper Jensen,et al. An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[56] Weisi Lin,et al. Bag-of-words representation for non-intrusive speech quality assessment , 2015, 2015 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP).

[57] Wissam A. Jassim,et al. Prediction of Speech Intelligibility Using a Neurogram Orthogonal Polynomial Measure (NOPM) , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[58] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[59] Mahdi Eftekhari,et al. An ensemble learning model for single-ended speech quality assessment using multiple-level signal decomposition method , 2014, 2014 4th International Conference on Computer and Knowledge Engineering (ICCKE).

[60] Patrick A. Naylor,et al. A non-intrusive PESQ measure , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[61] Weisi Lin,et al. Non-intrusive quality assessment for enhanced speech signals based on spectro-temporal features , 2014, 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW).

[62] Philipos C. Loizou,et al. Predicting the intelligibility of reverberant speech for cochlear implant listeners with a non-intrusive intelligibility measure , 2013, Biomed. Signal Process. Control..

[63] Arun Kumar,et al. Non-intrusive speech quality assessment using several combinations of auditory features , 2013, Int. J. Speech Technol..

[64] Yu Tsao,et al. Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[65] Michael Keyhl,et al. Perceptual Objective Listening Quality Assessment (POLQA), The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement Part I-Temporal Alignment , 2013 .

[66] Andrew Hines,et al. Speech intelligibility prediction using a Neurogram Similarity Index Measure , 2012, Speech Commun..

[67] W. Marsden. I and J , 2012 .

[68] Jesper Jensen,et al. An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[69] C. Spearman. The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[70] Wei Wei,et al. A New Neural Network Measure for Objective Speech Quality Evaluation , 2010, 2010 6th International Conference on Wireless Communications Networking and Mobile Computing (WiCOM).

[71] Tiago H. Falk,et al. A Non-Intrusive Quality and Intelligibility Measure of Reverberant and Dereverberated Speech , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[72] Weisi Lin,et al. Non-intrusive Speech Quality Assessment with Support Vector Regression , 2010, MMM.

[73] Yi Hu,et al. Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[74] Ayman Radwan,et al. Non-intrusive single-ended speech quality assessment in VoIP , 2007, Speech Commun..

[75] Hua Yuan,et al. Single-Ended Quality Measurement of Noise Suppressed Speech Based on Kullback-Leibler Distances , 2007, J. Multim..

[76] Tiago H. Falk,et al. Single-Ended Speech Quality Measurement Using Machine Learning Methods , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[77] W. Bastiaan Kleijn,et al. Low-Complexity, Nonintrusive Speech Quality Assessment , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[78] Jacob Benesty,et al. New insights into the noise reduction Wiener filter , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[79] Abdulhussain E. Mahdi,et al. A new single-ended measure for assessment of speech quality , 2006, INTERSPEECH.

[80] Jayne B Ahlstrom,et al. Word recognition in noise at higher-than-normal levels: decreases in scores and increases in masking. , 2005, The Journal of the Acoustical Society of America.

[81] Thomas Lenarz,et al. Amplitude-Mapping Effects on Speech Intelligibility With Unilateral and Bilateral Cochlear Implants , 2005, Ear and hearing.

[82] James M Kates,et al. Coherence and the speech intelligibility index. , 2004, The Journal of the Acoustical Society of America.

[83] Raymond L. Goldsworthy,et al. Analysis of speech-based Speech Transmission Index methods with implications for nonlinear operations. , 2004, The Journal of the Acoustical Society of America.

[84] Methods for objective and subjective assessment of quality Perceptual evaluation of speech quality ( PESQ ) : An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs , 2002 .

[85] Saeed Gazor,et al. An adaptive KLT approach for speech enhancement , 2001, IEEE Trans. Speech Audio Process..

[86] G. Studebaker,et al. Monosyllabic word recognition at higher-than-normal speech and noise levels. , 1999, The Journal of the Acoustical Society of America.

[87] John H. L. Hansen,et al. An effective quality evaluation protocol for speech enhancement algorithms , 1998, ICSLP.

[88] Pascal Scalart,et al. Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[89] Jonathan G. Fiscus,et al. Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[90] Janet M. Baker,et al. The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[91] David Malah,et al. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[92] T Houtgast,et al. A physical method for measuring speech-transmission quality. , 1980, The Journal of the Acoustical Society of America.

[93] J. C. Steinberg,et al. Factors Governing the Intelligibility of Speech Sounds , 1945 .