Exploiting time-frequency patterns with LSTM-RNNs for low-bitrate audio restoration

Perceptual audio coding is heavily and successfully applied for audio compression. However, perceptual audio coders may inject audible coding artifacts when encoding audio at low bitrates. Low-bitrate audio restoration is a challenging problem, which tries to recover a high-quality audio sample close to the uncompressed original from a low-quality encoded version. In this paper, we propose a novel data-driven method for audio restoration, where temporal and spectral dynamics are explicitly captured by a deep time-frequency-LSTM recurrent neural networks. Leveraging the captured temporal and spectral information can facilitate the task of learning a nonlinear mapping from the magnitude spectrogram of low-quality audio to that of high-quality audio. The proposed method substantially attenuates audible artifacts caused by codecs and is conceptually straightforward. Extensive experiments were carried out and the experimental results show that for low-bitrate audio at 96 kbps (mono), 64 kbps (mono), and 96 kbps (stereo), the proposed method can efficiently generate improved-quality audio that is competitive or even superior in perceptual quality to the audio produced by other state-of-the-art deep neural network methods and the LAME-MP3 codec.

[1]  Jürgen Schmidhuber,et al.  Multi-dimensional Recurrent Neural Networks , 2007, ICANN.

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Yun Lei,et al.  Application of convolutional neural networks to speaker recognition in noisy conditions , 2014, INTERSPEECH.

[4]  Karlheinz Brandenburg,et al.  MP3 and AAC Explained , 1999 .

[5]  Ted Painter,et al.  Audio Signal Processing and Coding , 2007 .

[6]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[7]  Erik Marchi,et al.  Sparse Autoencoder-Based Feature Transfer Learning for Speech Emotion Recognition , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[8]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[9]  Peyman Abbaszadeh Improving Hydrological Process Modeling Using Optimized Threshold-Based Wavelet De-Noising Technique , 2016, Water Resources Management.

[10]  Cha Zhang,et al.  CROWDMOS: An approach for crowdsourcing mean opinion score studies , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  P J Webros BACKPROPAGATION THROUGH TIME: WHAT IT DOES AND HOW TO DO IT , 1990 .

[12]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[13]  Jacob Benesty,et al.  Spectral Enhancement Methods , 2009 .

[14]  A. Spanias,et al.  Perceptual coding of digital audio , 2000, Proceedings of the IEEE.

[15]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[16]  Quoc V. Le,et al.  Recurrent Neural Networks for Noise Reduction in Robust ASR , 2012, INTERSPEECH.

[17]  Björn W. Schuller,et al.  Deep neural networks for anger detection from real life speech data , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW).

[18]  Jürgen Schmidhuber,et al.  Parallel Multi-Dimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation , 2015, NIPS.

[19]  Lianhong Cai,et al.  Spectro-Temporal Modelling with Time-Frequency LSTM and Structured Output Layer for Voice Conversion , 2017, INTERSPEECH.

[20]  George Saon,et al.  The IBM 2016 English Conversational Telephone Speech Recognition System , 2016, INTERSPEECH.

[21]  Sascha Disch,et al.  A harmonic bandwidth extension method for audio codecs , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Adam Coates,et al.  Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[23]  Vincent Dumoulin,et al.  Deconvolution and Checkerboard Artifacts , 2016 .

[24]  Chin-Hui Lee,et al.  DNN-based speech bandwidth expansion and its application to adding high-frequency missing features for automatic speech recognition of narrowband speech , 2015, INTERSPEECH.

[25]  Ronaldus Maria Aarts,et al.  Audio Bandwidth Extension: Application of Psychoacoustics, Signal Processing and Loudspeaker Design , 2004 .

[26]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[27]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[28]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[29]  Leszek Morzyński,et al.  Application of Neural Networks in Active Noise Reduction Systems , 2003, International journal of occupational safety and ergonomics : JOSE.

[30]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[31]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[32]  Hassan Khotanlou,et al.  An empirical technique for predicting noise exposure level in the typical embroidery workrooms using artificial neural networks , 2013 .

[33]  A. Gray,et al.  Distance measures for speech processing , 1976 .

[34]  Geoffrey Zweig,et al.  LSTM time and frequency recurrence for automatic speech recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[35]  Kristofer Kjörling,et al.  Spectral Band Replication, a Novel Approach in Audio Coding , 2002 .

[36]  Geoffrey Zweig,et al.  Exploring multidimensional lstms for large vocabulary ASR , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Brian Kingsbury,et al.  Very deep multilingual convolutional neural networks for LVCSR , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[39]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[40]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[41]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[42]  Björn W. Schuller,et al.  Universum Autoencoder-Based Domain Adaptation for Speech Emotion Recognition , 2017, IEEE Signal Processing Letters.

[43]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[44]  Mao-shen Jia,et al.  A harmonic bandwidth extension based on Gaussian mixture model , 2010, IEEE 10th INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS.

[45]  Paavo Alku,et al.  Speech bandwidth extension using Gaussian mixture model-based estimation of the highband mel spectrum , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Marek Domanski,et al.  Improved coding of tonal components in MPEG-4 AAC with SBR , 2008, 2008 16th European Signal Processing Conference.

[47]  Chin-Hui Lee,et al.  A deep neural network approach to speech bandwidth expansion , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  DeLiang Wang,et al.  Time and frequency domain long short-term memory for noise robust pitch tracking , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[50]  Paavo Alku,et al.  Bandwidth Extension of Telephone Speech to Low Frequencies Using Sinusoidal Synthesis and a Gaussian Mixture Model , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[51]  Chi-Min Liu,et al.  Compression Artifacts in Perceptual Audio Coding , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[52]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  Stefano Ermon,et al.  Audio Super Resolution using Neural Networks , 2017, ICLR.

[54]  Björn W. Schuller,et al.  Autoencoder-based Unsupervised Domain Adaptation for Speech Emotion Recognition , 2014, IEEE Signal Processing Letters.

[55]  Chih-Wei Wu,et al.  Blind bandwidth extension using K-means and Support Vector Regression , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[56]  Juhan Nam,et al.  Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms , 2017, ArXiv.

[57]  Jinwon Lee,et al.  A Fully Convolutional Neural Network for Speech Enhancement , 2016, INTERSPEECH.

[58]  Sascha Disch,et al.  A continuous modulated single sideband bandwidth extension , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[59]  Heiko Purnhagen,et al.  A Closer Look into MPEG-4 High Efficiency AAC , 2003 .