Convolutional Neural Networks to Enhance Coded Speech

Enhancing coded speech suffering from far-end acoustic background noise, quantization noise, and potentially transmission errors is a challenging task. In this paper, we propose two postprocessing approaches applying convolutional neural networks either in the time domain or the cepstral domain to enhance the coded speech without any modification of the codecs. The time-domain approach follows an end-to-end fashion, whereas the cepstral domain approach uses analysis–synthesis with cepstral domain features. The proposed postprocessors in both domains are evaluated for various narrowband and wideband speech codecs in a wide range of conditions. The proposed postprocessor improves perceptual evaluation of speech quality by up to 0.25 mean opinion score listening quality objective points for G.711, 0.30 points for G.726, 0.82 points for G.722, and 0.26 points for adaptive multirate wideband codec. In a subjective comparison category rating listening test, the proposed postprocessor on G.711-coded speech exceeds the speech quality of an ITU-T-standardized postfilter by 0.36 CMOS points, and obtains a clear preference of 1.77 CMOS points compared to legacy G.711, even better than uncoded speech with statistical significance. The source code for the cepstral domain approach to enhance G.711-coded speech is made available.11https://github.com/ifnspaml/Enhancement-Coded-Speech.

[1]  W. Bastiaan Kleijn,et al.  Removal of sparse-excitation artifacts in CELP , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Allen Gersho,et al.  Adaptive postfiltering for quality enhancement of coded speech , 1995, IEEE Trans. Speech Audio Process..

[3]  Tim Fingscheidt,et al.  Improving Vector Quantization-Based Decoders for Correlated Processes in Error-Free Transmission , 2016, ITG Symposium on Speech Communication.

[4]  Wouter Tirry,et al.  Instantaneous A Priori SNR Estimation by Cepstral Excitation Manipulation , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Yu Tsao,et al.  End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Yu Tsao,et al.  SNR-Aware Convolutional Neural Network Modeling for Speech Enhancement , 2016, INTERSPEECH.

[7]  Tim Fingscheidt,et al.  Artificial Speech Bandwidth Extension Using Deep Neural Networks for Wideband Spectral Envelope Estimation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[9]  Tom Bäckström,et al.  Speech Coding: with Code-Excited Linear Prediction , 2017 .

[10]  Balázs Kövesi,et al.  A PCM coding noise reduction for ITU-t g.711.1 , 2008, INTERSPEECH.

[11]  Patrick Bauer,et al.  HMM-based artificial bandwidth extension supported by neural networks , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[12]  Peter Vary,et al.  Digital Speech Transmission: Enhancement, Coding and Error Concealment , 2006 .

[13]  M. J. Narasimha,et al.  On the Computation of the Discrete Cosine Transform , 1978, IEEE Trans. Commun..

[14]  Sepp Hochreiter,et al.  Self-Normalizing Neural Networks , 2017, NIPS.

[15]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[16]  Pascal Scalart,et al.  A two-step noise reduction technique , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Hideki Kashioka,et al.  Speech restoration based on deep learning autoencoder with layer-wised pretraining , 2012, INTERSPEECH.

[18]  Jianfeng Xu,et al.  G.711.1: A wideband extension to ITU-T G.711 , 2008, 2008 16th European Signal Processing Conference.

[19]  Kyoung Mu Lee,et al.  Accurate Image Super-Resolution Using Very Deep Convolutional Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[21]  W. Bastiaan Kleijn,et al.  Generalized Postfilter for Speech Quality Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[23]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Tim Fingscheidt,et al.  An improved adpcm decoder by adaptively controlled quantization interval centroids , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[25]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Paavo Alku,et al.  An adaptive post-filtering method producing an artificial Lombard-like effect for intelligibility enhancement of narrowband telephone speech , 2014, Comput. Speech Lang..

[27]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Tim Fingscheidt,et al.  Lloyd-Max Quantization of Correlated Processes: How to Obtain Gains by Receiver-Sided Time-Variant Codebooks , 2015 .

[29]  Cyril Guillaume,et al.  An Instrumental Quality Measure for Artificially Bandwidth-Extended Speech Signals , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  Jinkyu Lee,et al.  Deep bi-directional long short-term memory based speech enhancement for wind noise reduction , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).

[31]  Jiri Malek,et al.  Single channel speech enhancement using convolutional neural network , 2017, 2017 IEEE International Workshop of Electronics, Control, Measurement, Signals and their Application to Mechatronics (ECMSM).

[32]  Man Mohan Sondhi,et al.  Enhancement of ADPCM speech coding with backward-adaptive algorithms for postfiltering and noise feedback , 1988, IEEE J. Sel. Areas Commun..

[33]  DeLiang Wang,et al.  Towards Scaling Up Classification-Based Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Anssi Rämö,et al.  On comparing speech quality of various narrow- and wideband speech codecs , 2005, Proceedings of the Eighth International Symposium on Signal Processing and Its Applications, 2005..

[35]  V. Ramamoorthy,et al.  Enhancement of ADPCM speech by adaptive postfiltering , 1984, AT&T Bell Laboratories Technical Journal.

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Tim Fingscheidt,et al.  Artificial bandwidth extension using deep neural networks for spectral envelope estimation , 2016, 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC).

[38]  Tim Fingscheidt,et al.  Improving scalar quantization for correlated processes using adaptive codebooks only at the receiver , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[39]  Yu-Bin Yang,et al.  Image Restoration Using Very Deep Convolutional Encoder-Decoder Networks with Symmetric Skip Connections , 2016, NIPS.

[40]  Jinwon Lee,et al.  A Fully Convolutional Neural Network for Speech Enhancement , 2016, INTERSPEECH.

[41]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[42]  Ziyue Zhao,et al.  Enhancement of G.711-Coded Speech Providing Quality Higher Than Uncoded , 2018, ITG Symposium on Speech Communication.

[43]  Daniel Rueckert,et al.  Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  E. Gilbert Capacity of a burst-noise channel , 1960 .

[45]  Tim Fingscheidt,et al.  A DNN regression approach to speech enhancement by artificial bandwidth extension , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[46]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[47]  Yu Tsao,et al.  Raw waveform-based speech enhancement by fully convolutional networks , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[48]  Yu Tsao,et al.  Complex spectrogram enhancement by convolutional neural network with multi-metrics learning , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[49]  Xiaoou Tang,et al.  Image Super-Resolution Using Deep Convolutional Networks , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Sridha Sridharan,et al.  The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms , 2010, INTERSPEECH.

[51]  Guillaume Fuchs,et al.  A comfort noise addition post-processor for enhancing low bit-rate speech coding in noisy environments , 2015, 2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[52]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[53]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[54]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[55]  Seunghoon Hong,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[56]  Quoc V. Le,et al.  Recurrent Neural Networks for Noise Reduction in Robust ASR , 2012, INTERSPEECH.