Integrating Dilated Convolution into DenseLSTM for Audio Source Separation

Herein, we proposed a multi-scale multi-band dilated time-frequency densely connected convolutional network (DenseNet) with long short-term memory (LSTM) for audio source separation. Because the spectrogram of the acoustic signal can be thought of as images as well as time series data, it is suitable for convolutional recurrent neural network (CRNN) architecture. We improved the audio source separation performance by applying the dilated block with a dilated convolution to CRNN architecture. The dilated block has the role of effectively increasing the receptive field in the spectrogram. In addition, it was designed in consideration of the acoustic characteristics that the frequency axis and the time axis in the spectrogram are changed by independent influences such as speech rate and pitch. In speech enhancement experiments, we estimated the speech signal using various deep learning architectures from a signal in which the music, noise, and speech were mixed. We conducted the subjective evaluation on the estimated speech signal. In addition, speech quality, intelligibility, separation, and speech recognition performance were also measured. In music signal separation, we estimated the music signal using several deep learning architectures from the mixture of the music and speech signal. After that, the separation performance and music identification accuracy were measured using the estimated music signal. Overall, the proposed architecture shows the best performance compared to other deep learning architectures not only in speech experiments but also in music experiments.

[1]  DeLiang Wang,et al.  Long short-term memory for speaker generalization in supervised speech separation. , 2017, The Journal of the Acoustical Society of America.

[2]  Demetri Psaltis,et al.  Holographic implementation of a fully connected neural network , 1990 .

[3]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  DeLiang Wang,et al.  Towards Scaling Up Classification-Based Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Emanuel A. P. Habets,et al.  Time–Frequency Masking Based Online Multi-Channel Speech Enhancement With Convolutional Recurrent Neural Networks , 2019, IEEE Journal of Selected Topics in Signal Processing.

[6]  DeLiang Wang,et al.  Gated Residual Networks With Dilated Convolutions for Monaural Speech Enhancement , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Rémi Gribonval,et al.  From Blind to Guided Audio Source Separation: How models and side information can improve the separation of sound , 2014, IEEE Signal Processing Magazine.

[8]  Simon Dixon,et al.  Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation , 2018, ISMIR.

[9]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[10]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[11]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[12]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[13]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[15]  Nam Soo Kim,et al.  NMF-based Target Source Separation Using Deep Neural Network , 2015, IEEE Signal Processing Letters.

[16]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[17]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[19]  Giuseppe Ciaburro,et al.  Sound Event Detection in Underground Parking Garage Using Convolutional Neural Network , 2020, Big Data Cogn. Comput..

[20]  James M Kates,et al.  Coherence and the speech intelligibility index. , 2004, The Journal of the Acoustical Society of America.

[21]  Oh-Wook Kwon,et al.  Source Separation Using Dilated Time-Frequency DenseNet for Music Identification in Broadcast Contents , 2020 .

[22]  Yi Hu,et al.  Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions. , 2009, The Journal of the Acoustical Society of America.

[23]  Xiaodong Li,et al.  Speech enhancement using progressive learning-based convolutional recurrent neural network , 2020 .

[24]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Gino Iannace,et al.  Improving Smart Cities Safety Using Sound Events Detection Based on Deep Neural Network Algorithms , 2020, Informatics.