论文信息 - More for Less: Non-Intrusive Speech Quality Assessment with Limited Annotations

More for Less: Non-Intrusive Speech Quality Assessment with Limited Annotations

Non-intrusive speech quality assessment is a crucial operation in multimedia applications. The scarcity of annotated data and the lack of a reference signal represent some of the main challenges for designing efficient quality assessment metrics. In this paper, we propose two multi-task models to tackle the problems above. In the first model, we first learn a feature representation with a degradation classifier on a large dataset. Then we perform MOS prediction and degradation classification simultaneously on a small dataset annotated with MOS. In the second approach, the initial stage consists of learning features with a deep clustering-based unsupervised feature representation on the large dataset. Next, we perform MOS prediction and cluster label classification simultaneously on a small dataset. The results show that the deep clustering-based model outperforms the degradation classifier-based model and the 3 baselines (autoencoder features, P.563, and SRMRnorm) on TCD-VoIP. This paper indicates that multi-task learning combined with feature representations from unlabelled data is a promising approach to deal with the lack of large MOS annotated datasets.

[1] Rich Caruana,et al. Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[2] Johannes Gehrke,et al. Non-intrusive Speech Quality Assessment Using Neural Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Stephen D. Voran,et al. Wawenets: A No-Reference Convolutional Waveform-Based Approach to Estimating Narrowband and Wideband Speech Quality , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Phuoc Tran-Gia,et al. Best Practices for QoE Crowdtesting: QoE Assessment With Crowdsourcing , 2014, IEEE Transactions on Multimedia.

[5] Joan Serra,et al. SESQA: semi-supervised learning for speech quality assessment , 2020, ArXiv.

[6] Methods , metrics and procedures for statistical evaluation , qualification and comparison of objective quality prediction models , 2013 .

[7] Andrew Y. Ng,et al. Learning Feature Representations with K-Means , 2012, Neural Networks: Tricks of the Trade.

[8] Sebastian Möller,et al. Non-intrusive Speech Quality Assessment for Super-wideband Speech Communication Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] J. Berger,et al. P.563—The ITU-T Standard for Single-Ended Speech Quality Assessment , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[10] Yu Tsao,et al. Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM , 2018, INTERSPEECH.

[11] Yi Li,et al. Simultaneous estimation of image quality and distortion via multi-task convolutional neural networks , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[12] Donald S. Williamson,et al. A Classification-Aided Framework for Non-Intrusive Speech Quality Assessment , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[13] Hemant A. Patil,et al. Novel deep autoencoder features for non-intrusive speech quality assessment , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[14] Qiang Yang,et al. An Overview of Multi-task Learning , 2018 .

[15] Simon Dixon,et al. Adversarial Unsupervised Domain Adaptation for Harmonic-Percussive Source Separation , 2021, IEEE Signal Processing Letters.

[16] Tiago H. Falk,et al. An improved non-intrusive intelligibility metric for noisy and reverberant speech , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[17] Andrew Hines,et al. TCD-VoIP, a research database of degraded speech for assessing quality in VoIP applications , 2015, 2015 Seventh International Workshop on Quality of Multimedia Experience (QoMEX).

[18] Bernd T. Meyer,et al. Prediction of Perceived Speech Quality Using Deep Machine Listening , 2018, INTERSPEECH.

[19] Andrew Hines,et al. Adapting the Quality of Experience Framework for Audio Archive Evaluation , 2019, 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX).

[20] Andrew Hines,et al. Development of a Speech Quality Database Under Uncontrolled Conditions , 2020, INTERSPEECH.

[21] Stefan Goetze,et al. Non-Intrusive Speech Quality Prediction Using Modulation Energies and LSTM-Network , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22] Ali Farhadi,et al. Unsupervised Deep Embedding for Clustering Analysis , 2015, ICML.

[23] En Zhu,et al. Deep Clustering with Convolutional Autoencoders , 2017, ICONIP.

[24] Sebastian Möller,et al. Towards speech quality assessment using a crowdsourcing approach: evaluation of standardized methods , 2020, Quality and User Experience.

[25] Matthijs Douze,et al. Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.