Representation models in single channel source separation

Model-based single-channel source separation (SCSS) is an ill-posed problem requiring source-specific prior knowledge. In this paper, we use representation learning and compare general stochastic networks (GSNs), Gauss Bernoulli restricted Boltzmann machines (GBRBMs), conditional Gauss Bernoulli restricted Boltzmann machines (CGBRBMs), and higher order contractive autoencoders (HCAEs) for modeling the source-specific knowledge. In particular, these models learn a mapping from speech mixture spectrogram representations to single-source spectrogram representations, i.e. we apply them as filter for the speech mixture. In the test case, the individual source spectrograms of both models are inferred and the softmask for re-synthesis of the time signals is determined thereof. We evaluate the deep architectures on data of the 2nd CHiME speech separation challenge and provide results for a speaker dependent, a speaker independent, a matched noise condition and an unmatched noise condition task. Our experiments show the best PESQ and overall perceptual score on average for GSNs in all four tasks.

[1]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[2]  Geoffrey E. Hinton,et al.  Application of Deep Belief Networks for Natural Language Understanding , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Franz Pernkopf,et al.  Self-adaption in single-channel source separation , 2014, INTERSPEECH.

[4]  Dong Yu,et al.  Deep Learning and Its Applications to Signal and Information Processing [Exploratory DSP] , 2011, IEEE Signal Processing Magazine.

[5]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[6]  Pascal Vincent,et al.  Generalized Denoising Auto-Encoders as Generative Models , 2013, NIPS.

[7]  Sam T. Roweis,et al.  Factorial models and refiltering for speech separation and denoising , 2003, INTERSPEECH.

[8]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[9]  Geoffrey E. Hinton,et al.  Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine , 2010, NIPS.

[10]  Paris Smaragdis,et al.  Singing-Voice Separation from Monaural Recordings using Deep Recurrent Neural Networks , 2014, ISMIR.

[11]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[12]  Franz Pernkopf,et al.  General Stochastic Networks for Classification , 2014, NIPS.

[13]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[14]  Rémi Gribonval,et al.  Adaptation of Bayesian Models for Single-Channel Source Separation and its Application to Voice/Music Separation in Popular Songs , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Dong Yu,et al.  Deep Learning and Its Applications to Signal and Information Processing , 2011 .

[16]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[17]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[18]  Yoshua Bengio,et al.  Deep Generative Stochastic Networks Trainable by Backprop , 2013, ICML.

[19]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[20]  Franz Pernkopf,et al.  Single channel source separation with general stochastic networks , 2014, INTERSPEECH.

[21]  Emmanuel Vincent,et al.  Subjective and Objective Quality Assessment of Audio Source Separation , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[23]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[24]  Franz Pernkopf,et al.  On Self-Adaptation in Single-Channel Source Separation , 2014, Interspeech 2014.

[25]  Pascal Vincent,et al.  Higher Order Contractive Auto-Encoder , 2011, ECML/PKDD.

[26]  DeLiang Wang,et al.  Cocktail Party Processing via Structured Prediction , 2012, NIPS.

[27]  Yoshua Bengio,et al.  Regularized Auto-Encoders Estimate Local Statistics , 2012, ICLR.

[28]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[29]  Jon Barker,et al.  The second ‘CHiME’ speech separation and recognition challenge: An overview of challenge systems and outcomes , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[30]  Paris Smaragdis,et al.  Convolutive Speech Bases and Their Application to Supervised Speech Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[32]  Geoffrey E. Hinton,et al.  Factored conditional restricted Boltzmann Machines for modeling motion style , 2009, ICML '09.

[33]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[34]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[35]  Tapani Raiko,et al.  Improved Learning of Gaussian-Bernoulli Restricted Boltzmann Machines , 2011, ICANN.

[36]  Björn W. Schuller,et al.  Single-channel speech separation with memory-enhanced recurrent neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Pascal Vincent,et al.  Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[38]  Franz Pernkopf,et al.  On linear and mixmax interaction models for single channel source separation , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).