论文信息 - Source-Aware Neural Speech Coding for Noisy Speech Compression

Source-Aware Neural Speech Coding for Noisy Speech Compression

This paper introduces a novel neural network-based speech coding system that can handle noisy speech effectively. The pro-posed source-aware neural audio coding (SANAC) system harmonizes a deep autoencoder-based source separation model and a neural coding system, so that it can explicitly perform source separation and coding in the latent space. An added benefit of this system is that the codec can allocate different amount of bits to the underlying sources, so that the more important source sounds better in the decoded signal. We target the use case where the user on the receiver side cares the quality of the non-speech components in the speech communication, while the speech source still carries the most important information. Both objective and subjective evaluation tests show that SANAC can recover the original noisy speech in a better quality than the baseline neural audio coding system, which is with no source-aware coding mechanism

[1] Minje Kim,et al. Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding , 2019, INTERSPEECH.

[2] Nima Mesgarani,et al. TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Nima Mesgarani,et al. Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4] Andreas Niedermeier,et al. Intelligent Gap Filling in Perceptual Transform Coding of Audio , 2016 .

[5] Srihari Kankanahalli,et al. End-To-End Optimized Speech Coding with Deep Neural Networks , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Carla Teixeira Lopes,et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[7] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[8] Lucas Theis,et al. Lossy Image Compression with Compressive Autoencoders , 2017, ICLR.

[9] Jan Plogsties,et al. MPEG-H Audio—The New Standard for Universal Spatial / 3D Audio Coding , 2014 .

[10] Paris Smaragdis,et al. Online PLCA for Real-Time Semi-supervised Source Separation , 2012, LVA/ICA.

[11] Minje Kim,et al. Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Daniel Rueckert,et al. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Luca Benini,et al. Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations , 2017, NIPS.

[14] Jan Skoglund,et al. A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet , 2019, INTERSPEECH.

[15] Roch Lefebvre,et al. The adaptive multirate wideband speech codec (AMR-WB) , 2002, IEEE Trans. Speech Audio Process..

[16] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Quan Wang,et al. Wavenet Based Low Rate Speech Coding , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Jesper Jensen,et al. A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20] Guillaume Fuchs,et al. Frequency-domain Comfort Noise Generation for Discontinuous Transmission in EVS , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Timothy B. Terriberry,et al. Definition of the Opus Audio Codec , 2012, RFC.

[22] Manfred R. Schroeder,et al. Code-excited linear prediction(CELP): High-quality speech at very low bit rates , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23] Michael Chinen,et al. Robust Low Rate Speech Coding Based on Cloned Networks and Wavenet , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24] Jonathan Le Roux,et al. SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).