论文信息 - Exploiting time-frequency patterns with LSTM-RNNs for low-bitrate audio restoration

Exploiting time-frequency patterns with LSTM-RNNs for low-bitrate audio restoration

Perceptual audio coding is heavily and successfully applied for audio compression. However, perceptual audio coders may inject audible coding artifacts when encoding audio at low bitrates. Low-bitrate audio restoration is a challenging problem, which tries to recover a high-quality audio sample close to the uncompressed original from a low-quality encoded version. In this paper, we propose a novel data-driven method for audio restoration, where temporal and spectral dynamics are explicitly captured by a deep time-frequency-LSTM recurrent neural networks. Leveraging the captured temporal and spectral information can facilitate the task of learning a nonlinear mapping from the magnitude spectrogram of low-quality audio to that of high-quality audio. The proposed method substantially attenuates audible artifacts caused by codecs and is conceptually straightforward. Extensive experiments were carried out and the experimental results show that for low-bitrate audio at 96 kbps (mono), 64 kbps (mono), and 96 kbps (stereo), the proposed method can efficiently generate improved-quality audio that is competitive or even superior in perceptual quality to the audio produced by other state-of-the-art deep neural network methods and the LAME-MP3 codec.

[1] Jürgen Schmidhuber,et al. Multi-dimensional Recurrent Neural Networks , 2007, ICANN.

[2] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3] Yun Lei,et al. Application of convolutional neural networks to speaker recognition in noisy conditions , 2014, INTERSPEECH.

[4] Karlheinz Brandenburg,et al. MP3 and AAC Explained , 1999 .

[5] Ted Painter,et al. Audio Signal Processing and Coding , 2007 .

[6] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[7] Erik Marchi,et al. Sparse Autoencoder-Based Feature Transfer Learning for Speech Emotion Recognition , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[8] Alex Graves,et al. Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[9] Peyman Abbaszadeh. Improving Hydrological Process Modeling Using Optimized Threshold-Based Wavelet De-Noising Technique , 2016, Water Resources Management.

[10] Cha Zhang,et al. CROWDMOS: An approach for crowdsourcing mean opinion score studies , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] P J Webros. BACKPROPAGATION THROUGH TIME: WHAT IT DOES AND HOW TO DO IT , 1990 .

[12] Colin Raffel,et al. librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[13] Jacob Benesty,et al. Spectral Enhancement Methods , 2009 .

[14] A. Spanias,et al. Perceptual coding of digital audio , 2000, Proceedings of the IEEE.

[15] Chong Wang,et al. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[16] Quoc V. Le,et al. Recurrent Neural Networks for Noise Reduction in Robust ASR , 2012, INTERSPEECH.

[17] Björn W. Schuller,et al. Deep neural networks for anger detection from real life speech data , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW).

[18] Jürgen Schmidhuber,et al. Parallel Multi-Dimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation , 2015, NIPS.

[19] Lianhong Cai,et al. Spectro-Temporal Modelling with Time-Frequency LSTM and Structured Output Layer for Voice Conversion , 2017, INTERSPEECH.

[20] George Saon,et al. The IBM 2016 English Conversational Telephone Speech Recognition System , 2016, INTERSPEECH.

[21] Sascha Disch,et al. A harmonic bandwidth extension method for audio codecs , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22] Adam Coates,et al. Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[23] Vincent Dumoulin,et al. Deconvolution and Checkerboard Artifacts , 2016 .

[24] Chin-Hui Lee,et al. DNN-based speech bandwidth expansion and its application to adding high-frequency missing features for automatic speech recognition of narrowband speech , 2015, INTERSPEECH.

[25] Ronaldus Maria Aarts,et al. Audio Bandwidth Extension: Application of Psychoacoustics, Signal Processing and Loudspeaker Design , 2004 .

[26] Andrew W. Senior,et al. Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[27] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[28] Vivienne Sze,et al. Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[29] Leszek Morzyński,et al. Application of Neural Networks in Active Noise Reduction Systems , 2003, International journal of occupational safety and ergonomics : JOSE.

[30] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[31] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[32] Hassan Khotanlou,et al. An empirical technique for predicting noise exposure level in the typical embroidery workrooms using artificial neural networks , 2013 .

[33] A. Gray,et al. Distance measures for speech processing , 1976 .

[34] Geoffrey Zweig,et al. LSTM time and frequency recurrence for automatic speech recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[35] Kristofer Kjörling,et al. Spectral Band Replication, a Novel Approach in Audio Coding , 2002 .

[36] Geoffrey Zweig,et al. Exploring multidimensional lstms for large vocabulary ASR , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37] Brian Kingsbury,et al. Very deep multilingual convolutional neural networks for LVCSR , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38] Geoffrey E. Hinton,et al. Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[39] Geoffrey E. Hinton,et al. Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[40] Kuldip K. Paliwal,et al. Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[41] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[42] Björn W. Schuller,et al. Universum Autoencoder-Based Domain Adaptation for Speech Emotion Recognition , 2017, IEEE Signal Processing Letters.

[43] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[44] Mao-shen Jia,et al. A harmonic bandwidth extension based on Gaussian mixture model , 2010, IEEE 10th INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS.

[45] Paavo Alku,et al. Speech bandwidth extension using Gaussian mixture model-based estimation of the highband mel spectrum , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46] Marek Domanski,et al. Improved coding of tonal components in MPEG-4 AAC with SBR , 2008, 2008 16th European Signal Processing Conference.

[47] Chin-Hui Lee,et al. A deep neural network approach to speech bandwidth expansion , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48] DeLiang Wang,et al. Time and frequency domain long short-term memory for noise robust pitch tracking , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49] Jae Lim,et al. Signal estimation from modified short-time Fourier transform , 1984 .

[50] Paavo Alku,et al. Bandwidth Extension of Telephone Speech to Low Frequencies Using Sinusoidal Synthesis and a Gaussian Mixture Model , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[51] Chi-Min Liu,et al. Compression Artifacts in Perceptual Audio Coding , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[52] Aren Jansen,et al. CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53] Stefano Ermon,et al. Audio Super Resolution using Neural Networks , 2017, ICLR.

[54] Björn W. Schuller,et al. Autoencoder-based Unsupervised Domain Adaptation for Speech Emotion Recognition , 2014, IEEE Signal Processing Letters.

[55] Chih-Wei Wu,et al. Blind bandwidth extension using K-means and Support Vector Regression , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[56] Juhan Nam,et al. Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms , 2017, ArXiv.

[57] Jinwon Lee,et al. A Fully Convolutional Neural Network for Speech Enhancement , 2016, INTERSPEECH.

[58] Sascha Disch,et al. A continuous modulated single sideband bandwidth extension , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[59] Heiko Purnhagen,et al. A Closer Look into MPEG-4 High Efficiency AAC , 2003 .