More for Less: Non-Intrusive Speech Quality Assessment with Limited Annotations

Non-intrusive speech quality assessment is a crucial operation in multimedia applications. The scarcity of annotated data and the lack of a reference signal represent some of the main challenges for designing efficient quality assessment metrics. In this paper, we propose two multi-task models to tackle the problems above. In the first model, we first learn a feature representation with a degradation classifier on a large dataset. Then we perform MOS prediction and degradation classification simultaneously on a small dataset annotated with MOS. In the second approach, the initial stage consists of learning features with a deep clustering-based unsupervised feature representation on the large dataset. Next, we perform MOS prediction and cluster label classification simultaneously on a small dataset. The results show that the deep clustering-based model outperforms the degradation classifier-based model and the 3 baselines (autoencoder features, P.563, and SRMRnorm) on TCD-VoIP. This paper indicates that multi-task learning combined with feature representations from unlabelled data is a promising approach to deal with the lack of large MOS annotated datasets.

[1]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[2]  Johannes Gehrke,et al.  Non-intrusive Speech Quality Assessment Using Neural Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Stephen D. Voran,et al.  Wawenets: A No-Reference Convolutional Waveform-Based Approach to Estimating Narrowband and Wideband Speech Quality , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Phuoc Tran-Gia,et al.  Best Practices for QoE Crowdtesting: QoE Assessment With Crowdsourcing , 2014, IEEE Transactions on Multimedia.

[5]  Joan Serra,et al.  SESQA: semi-supervised learning for speech quality assessment , 2020, ArXiv.

[6]  Methods , metrics and procedures for statistical evaluation , qualification and comparison of objective quality prediction models , 2013 .

[7]  Andrew Y. Ng,et al.  Learning Feature Representations with K-Means , 2012, Neural Networks: Tricks of the Trade.

[8]  Sebastian Möller,et al.  Non-intrusive Speech Quality Assessment for Super-wideband Speech Communication Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  J. Berger,et al.  P.563—The ITU-T Standard for Single-Ended Speech Quality Assessment , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Yu Tsao,et al.  Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM , 2018, INTERSPEECH.

[11]  Yi Li,et al.  Simultaneous estimation of image quality and distortion via multi-task convolutional neural networks , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[12]  Donald S. Williamson,et al.  A Classification-Aided Framework for Non-Intrusive Speech Quality Assessment , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[13]  Hemant A. Patil,et al.  Novel deep autoencoder features for non-intrusive speech quality assessment , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[14]  Qiang Yang,et al.  An Overview of Multi-task Learning , 2018 .

[15]  Simon Dixon,et al.  Adversarial Unsupervised Domain Adaptation for Harmonic-Percussive Source Separation , 2021, IEEE Signal Processing Letters.

[16]  Tiago H. Falk,et al.  An improved non-intrusive intelligibility metric for noisy and reverberant speech , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[17]  Andrew Hines,et al.  TCD-VoIP, a research database of degraded speech for assessing quality in VoIP applications , 2015, 2015 Seventh International Workshop on Quality of Multimedia Experience (QoMEX).

[18]  Bernd T. Meyer,et al.  Prediction of Perceived Speech Quality Using Deep Machine Listening , 2018, INTERSPEECH.

[19]  Andrew Hines,et al.  Adapting the Quality of Experience Framework for Audio Archive Evaluation , 2019, 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX).

[20]  Andrew Hines,et al.  Development of a Speech Quality Database Under Uncontrolled Conditions , 2020, INTERSPEECH.

[21]  Stefan Goetze,et al.  Non-Intrusive Speech Quality Prediction Using Modulation Energies and LSTM-Network , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Ali Farhadi,et al.  Unsupervised Deep Embedding for Clustering Analysis , 2015, ICML.

[23]  En Zhu,et al.  Deep Clustering with Convolutional Autoencoders , 2017, ICONIP.

[24]  Sebastian Möller,et al.  Towards speech quality assessment using a crowdsourcing approach: evaluation of standardized methods , 2020, Quality and User Experience.

[25]  Matthijs Douze,et al.  Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.