Multichannel Audio Source Separation With Deep Neural Networks

This article addresses the problem of multichannel audio source separation. We propose a framework where deep neural networks (DNNs) are used to model the source spectra and combined with the classical multichannel Gaussian model to exploit the spatial information. The parameters are estimated in an iterative expectation-maximization (EM) fashion and used to derive a multichannel Wiener filter. We present an extensive experimental study to show the impact of different design choices on the performance of the proposed technique. We consider different cost functions for the training of DNNs, namely the probabilistically motivated Itakura-Saito divergence, and also Kullback-Leibler, Cauchy, mean squared error, and phase-sensitive cost functions. We also study the number of EM iterations and the use of multiple DNNs, where each DNN aims to improve the spectra estimated by the preceding EM iteration. Finally, we present its application to a speech enhancement problem. The experimental results show the benefit of the proposed multichannel approach over a single-channel DNN-based approach and the conventional multichannel nonnegative matrix factorization-based iterative EM algorithm.

[1]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[2]  Bhiksha Raj,et al.  Techniques for Noise Robustness in Automatic Speech Recognition , 2012, Techniques for Noise Robustness in Automatic Speech Recognition.

[3]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[4]  Franck Giron,et al.  Deep neural network based instrument extraction from music , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  DeLiang Wang,et al.  Improving Robustness of Deep Neural Network Acoustic Models via Speech Separation and Joint Adaptive Training , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Seiichi Nakagawa,et al.  Single-channel dereverberation by feature mapping using cascade neural networks for robust distant speaker identification and speech recognition , 2014, EURASIP J. Audio Speech Music. Process..

[7]  Jun Du,et al.  Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[8]  Yi Jiang,et al.  Binaural Classification for Reverberant Speech Segregation Using Deep Neural Networks , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Antoine Liutkus,et al.  Scalable audio separation with light Kernel Additive Modelling , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Rémi Gribonval,et al.  Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[12]  Masahito Togami,et al.  Simultaneous Optimization of Acoustic Echo Reduction, Speech Dereverberation, and Noise Reduction against Mutual Interference , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Masakiyo Fujimoto,et al.  Exploring multi-channel features for denoising-autoencoder-based speech enhancement , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Antoine Liutkus,et al.  Cauchy nonnegative matrix factorization , 2015, 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[15]  Bhiksha Raj,et al.  Microphone Array Processing for Distant Speech Recognition: From Close-Talking Microphones to Far-Field Sensors , 2012, IEEE Signal Processing Magazine.

[16]  DeLiang Wang,et al.  Towards Scaling Up Classification-Based Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  DeLiang Wang,et al.  A deep neural network for time-domain signal reconstruction , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Paris Smaragdis,et al.  Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  John McDonough,et al.  Distant Speech Recognition , 2009 .

[20]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[21]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[22]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[23]  Benedikt Loesch,et al.  Adaptive Segmentation and Separation of Determined Convolutive Mixtures under Dynamic Conditions , 2010, LVA/ICA.

[24]  Xabier Jaureguiberry,et al.  The Flexible Audio Source Separation Toolbox Version 2.0 , 2014, ICASSP 2014.

[25]  Antoine Liutkus,et al.  Robust ASR using neural network based speech enhancement and feature simulation , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[26]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Paris Smaragdis,et al.  Experiments on deep learning for speech denoising , 2014, INTERSPEECH.

[28]  Te-Won Lee,et al.  Blind Speech Separation , 2007, Blind Speech Separation.

[29]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[30]  Guillermo Sapiro,et al.  Supervised non-negative matrix factorization for audio source separation , 2015 .

[31]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[32]  Laurent Girin,et al.  Professionally-produced Music Separation Guided by Covers , 2012, ISMIR.

[33]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Emmanuel Vincent,et al.  Fusion Methods for Speech Enhancement and Audio Source Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[35]  Francis Bach,et al.  Online algorithms for nonnegative matrix factorization with the Itakura-Saito divergence , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[36]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[37]  Arun Ross,et al.  Microphone Arrays , 2009, Encyclopedia of Biometrics.

[38]  Rémi Gribonval,et al.  Multichannel harmonic and percussive component separation by joint modeling of spatial and spectral continuity , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Paris Smaragdis,et al.  Singing-Voice Separation from Monaural Recordings using Deep Recurrent Neural Networks , 2014, ISMIR.

[40]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[42]  Antoine Liutkus,et al.  Kernel Additive Models for Source Separation , 2014, IEEE Transactions on Signal Processing.

[43]  Alexey Ozerov,et al.  Notes on Nonnegative Tensor Factorization of the Spectrogram for Audio Source Separation: Statistical Insights and Towards Self-Clustering of the Spatial Cues , 2010, CMMR.

[44]  Björn W. Schuller,et al.  Discriminatively trained recurrent neural networks for single-channel speech separation , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[45]  Jonathan Le Roux,et al.  The MERL/SRI system for the 3RD CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[46]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[47]  Emmanuel Vincent,et al.  Multi-source TDOA estimation in reverberant audio using angular spectra and clustering , 2012, Signal Process..

[48]  Yoshua Bengio,et al.  Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[49]  Roland Badeau,et al.  A tempering approach for Itakura-Saito non-negative matrix factorization. With application to music transcription , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[50]  Asoke K. Nandi,et al.  Blind Source Separation , 1999 .

[51]  Antoine Liutkus,et al.  Generalized Wiener filtering with fractional power spectrograms , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[52]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[53]  Rémi Gribonval,et al.  From Blind to Guided Audio Source Separation: How models and side information can improve the separation of sound , 2014, IEEE Signal Processing Magazine.

[54]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[55]  Ulpu Remes,et al.  Techniques for Noise Robustness in Automatic Speech Recognition , 2012 .

[56]  Nancy Bertin,et al.  Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis , 2009, Neural Computation.

[57]  Emmanuel Vincent,et al.  A General Flexible Framework for the Handling of Prior Information in Audio Source Separation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[58]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[59]  DeLiang Wang,et al.  A Feature Study for Classification-Based Speech Separation at Low Signal-to-Noise Ratios , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[60]  John McDonough,et al.  Microphone Arrays , 2012, Techniques for Noise Robustness in Automatic Speech Recognition.

[61]  Mark D. Plumbley,et al.  Probabilistic Modeling Paradigms for Audio Source Separation , 2010 .