Audio Codec Enhancement with Generative Adversarial Networks

Audio codecs are typically transform-domain based and efficiently code stationary audio signals, but they struggle with speech and signals containing dense transient events such as applause. Specifically, with these two classes of signals as examples, we demonstrate a technique for restoring audio from coding noise based on generative adversarial networks (GAN). A primary advantage of the proposed GAN-based coded audio enhancer is that the method operates end-to-end directly on decoded audio samples, eliminating the need to design any manually-crafted frontend. Furthermore, the enhancement approach described in this paper can improve the sound quality of low-bit rate coded audio without any modifications to the existent standard-compliant encoders. Subjective tests illustrate that the proposed enhancer improves the quality of speech and difficult to code applause excerpts significantly.

[1]  Sascha Disch,et al.  Methods for Low Bitrate Coding Enhancement Part I: Spectral Restoration , 2017 .

[2]  Roch Lefebvre,et al.  Pre-echo noise reduction in frequency-domain audio codecs , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[4]  Andrew Hines,et al.  Objective Assessment of Perceptual Audio Quality Using ViSQOLAudio , 2017, IEEE Transactions on Broadcasting.

[5]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[6]  Quan Wang,et al.  Wavenet Based Low Rate Speech Coding , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Junichi Yamagishi,et al.  Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System Using Deep Recurrent Neural Networks , 2016, INTERSPEECH.

[8]  Armin Taghipour,et al.  On Similarity and Density of Applause Sounds , 2017 .

[9]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Björn Schuller,et al.  Exploiting time-frequency patterns with LSTM-RNNs for low-bitrate audio restoration , 2019, Neural Computing and Applications.

[11]  Arijit Biswas,et al.  Temporal Noise Shaping with Companding , 2018, INTERSPEECH.

[12]  Tim Fingscheidt,et al.  Convolutional Neural Networks to Enhance Coded Speech , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[14]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[15]  Yu-Bin Yang,et al.  Image Restoration Using Very Deep Convolutional Encoder-Decoder Networks with Symmetric Skip Connections , 2016, NIPS.

[16]  Jan Skoglund,et al.  Improving Opus Low Bit Rate Quality with Neural Speech Synthesis , 2019, INTERSPEECH.

[17]  Sugato Chakravarty,et al.  Method for the subjective assessment of intermedi-ate quality levels of coding systems , 2001 .

[18]  Simon King,et al.  The voice bank corpus: Design, collection and data analysis of a large regional accent speech database , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[19]  M. Dietz,et al.  MPEG-4 high-efficiency AAC coding [Standards in a Nutshell] , 2008 .

[20]  Scott G. Norcross,et al.  AC-4 – The Next Generation Audio Codec , 2016 .

[21]  Raymond Y. K. Lau,et al.  Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Adam Scherlis,et al.  WaveMedic: Convolutional Neural Networks for Speech Audio Enhancement , 2016 .

[23]  Ted Painter,et al.  Audio Signal Processing and Coding , 2007 .

[24]  Sascha Disch,et al.  Transient-to-noise ratio restoration of coded applause-like signals , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[25]  Schuyler R. Quackenbush MPEG Unified Speech and Audio Coding , 2013, IEEE MultiMedia.

[26]  Eunmi Oh,et al.  Music Enhancement by a Novel CNN Architecture , 2018 .

[27]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[28]  Jürgen Herre,et al.  MPEG-4 High-Efficiency AAC Coding , 2007 .

[29]  Chi-Min Liu,et al.  Compression Artifacts in Perceptual Audio Coding , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Xihong Wu,et al.  Bandwidth Extension Method Based on Generative Adversarial Nets for Audio Compression , 2018 .