论文信息 - InSE-NET: A Perceptually Coded Audio Quality Model based on CNN

InSE-NET: A Perceptually Coded Audio Quality Model based on CNN

Automatic coded audio quality assessment is an important task whose progress is hampered by the scarcity of human annotations, poor generalization to unseen codecs, bitrates, contenttypes, and a lack of flexibility of existing approaches. One of the typical human-perception-related metrics, ViSQOL v3 (ViV3), has been proven to provide a high correlation to the quality scores rated by humans. In this study, we take steps to tackle problems of predicting coded audio quality by completely utilizing programmatically generated data that is informed with expert domain knowledge. We propose a learnable neural network, entitled InSE-NET, with a backbone of Inception and Squeeze-and-Excitation modules to assess the perceived quality of coded audio at a 48 kHz sample rate. We demonstrate that synthetic data augmentation is capable of enhancing the prediction. Our proposed method is intrusive, i.e. it requires Gammatone spectrograms of unencoded reference signals. Besides a comparable performance to ViV3, our approach provides a more robust prediction towards higher bitrates.

[1] Schuyler R. Quackenbush. MPEG Unified Speech and Audio Coding , 2013, IEEE MultiMedia.

[2] Bernd T. Meyer,et al. Prediction of Perceived Speech Quality Using Deep Machine Listening , 2018, INTERSPEECH.

[3] Kuldip K. Paliwal,et al. Deep Learning-Based Single-Ended Objective Quality Measures for Time-Scale Modified Audio , 2020, ArXiv.

[4] Jan Mark de Haan,et al. Nonintrusive Speech Intelligibility Prediction Using Convolutional Neural Networks , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5] Oriol Nieto,et al. Predicting Audio Advertisement Quality , 2018, WSDM.

[6] Thomas Sporer,et al. PEAQ - The ITU Standard for Objective Measurement of Perceived Audio Quality , 2000 .

[7] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Stephen D. Voran,et al. Wawenets: A No-Reference Convolutional Waveform-Based Approach to Estimating Narrowband and Wideband Speech Quality , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Hemant A. Patil,et al. Novel deep autoencoder features for non-intrusive speech quality assessment , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[10] Johannes Gehrke,et al. Intrusive and Non-Intrusive Perceptual Speech Quality Assessment Using a Convolutional Neural Network , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[11] Yu Tsao,et al. Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM , 2018, INTERSPEECH.

[12] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Xavier Serra,et al. Experimenting with musically motivated convolutional neural networks , 2016, 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI).

[14] Joan Serra,et al. SESQA: semi-supervised learning for speech quality assessment , 2020, ArXiv.

[15] Anil C. Kokaram,et al. ViSQOL: an objective speech quality model , 2015, EURASIP J. Audio Speech Music. Process..

[16] Louis Dunn Fielder,et al. ISO/IEC MPEG-2 Advanced Audio Coding , 1997 .

[17] Sebastian Möller,et al. Non-intrusive Speech Quality Assessment for Super-wideband Speech Communication Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Aren Jansen,et al. CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Arijit Biswas,et al. Audio Codec Enhancement with Generative Adversarial Networks , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Schuyler Quackenbush,et al. Performance of MPEG Unified Speech and Audio Coding , 2011 .

[21] Hervé Glotin,et al. Audio Bird Classification with Inception-v4 extended with Time and Time-Frequency Attention Mechanisms , 2017, CLEF.

[22] Andrew Hines,et al. Objective Assessment of Perceptual Audio Quality Using ViSQOLAudio , 2017, IEEE Transactions on Broadcasting.

[23] Method for the subjective assessment of intermediate quality level of , 2014 .

[24] Ross Cutler,et al. DNN No-Reference PSTN Speech Quality Prediction , 2020, INTERSPEECH.

[25] Jeroen Breebaart,et al. An Overview of the Coding Standard MPEG-4 Audio Amendments 1 and 2: HE-AAC, SSC, and HE-AAC v2 , 2009, EURASIP J. Audio Speech Music. Process..

[26] Johannes Gehrke,et al. Non-intrusive Speech Quality Assessment Using Neural Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Edward Jones,et al. Audio quality assessment techniques - A review, and recent developments , 2009, Signal Process..

[29] Jürgen Herre,et al. Can We Still Use PEAQ? A Performance Analysis of the ITU Standard for the Objective Assessment of Perceived Audio Quality , 2020, 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX).

[30] Yu Tsao,et al. Learning With Learned Loss Function: Speech Enhancement With Quality-Net to Improve Perceptual Evaluation of Speech Quality , 2019, IEEE Signal Processing Letters.

[31] Andrew Hines,et al. ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric , 2020, 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX).

[32] Enhua Wu,et al. Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33] Sebastian Möller,et al. Full-Reference Speech Quality Estimation with Attentional Siamese Neural Networks , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).