Full-Reference Speech Quality Estimation with Attentional Siamese Neural Networks

In this paper, we present a full-reference speech quality prediction model with a deep learning approach. The model determines a feature representation of the reference and the degraded signal through a Siamese recurrent convolutional network that shares the weights for both signals as input. The resulting features are then used to align the signals with an attention mechanism and are finally combined to estimate the overall speech quality. The proposed network architecture represents a simple solution for the time-alignment problem that occurs for speech signals transmitted through Voice-Over-IP networks and shows how the clean reference signal can be incorporated into speech quality models that are based on end-to-end trained neural networks.

[1]  Bernd T. Meyer,et al.  Improving Deep Models of Speech Quality Prediction through Voice Activity Detection and Entropy-based Measures , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Jingming Kuang,et al.  Non-intrusive Speech Quality Assessment Using Deep Belief Network and Backpropagation Neural Network , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[3]  Stefan Goetze,et al.  Non-Intrusive Speech Quality Prediction Using Modulation Energies and LSTM-Network , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Anil C. Kokaram,et al.  ViSQOL: an objective speech quality model , 2015, EURASIP J. Audio Speech Music. Process..

[5]  Sebastian Möller,et al.  Non-intrusive Speech Quality Assessment for Super-wideband Speech Communication Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Sebastian Möller,et al.  Quality Degradation Diagnosis for Voice Networks - Estimating the Perceived Noisiness, Coloration, and Discontinuity of Transmitted Speech , 2019, INTERSPEECH.

[7]  Chengzhu Yu,et al.  Seq2Seq Attentional Siamese Neural Networks for Text-dependent Speaker Verification , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Hemant A. Patil,et al.  Novel deep autoencoder features for non-intrusive speech quality assessment , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[9]  Lars Schmidt-Thieme,et al.  NeuralWarp: Time-Series Similarity with Warping Networks , 2018, ArXiv.

[10]  Methods , metrics and procedures for statistical evaluation , qualification and comparison of objective quality prediction models , 2013 .

[11]  Yu Tsao,et al.  Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM , 2018, INTERSPEECH.

[12]  Michael Keyhl,et al.  Perceptual Objective Listening Quality Assessment (POLQA), The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement Part I-Temporal Alignment , 2013 .

[13]  Bernd T. Meyer,et al.  Prediction of Perceived Speech Quality Using Deep Machine Listening , 2018, INTERSPEECH.

[14]  Stephen D. Voran,et al.  WEnets: A Convolutional Framework for Evaluating Audio Waveforms , 2019, ArXiv.

[15]  Sebastian Bosse,et al.  Deep Neural Networks for No-Reference and Full-Reference Image Quality Assessment , 2016, IEEE Transactions on Image Processing.

[16]  METHODS FOR SUBJECTIVE DETERMINATION OF TRANSMISSION QUALITY Summary , 2022 .