论文信息 - Representation Learning for Single-Channel Source Separation and Bandwidth Extension

Representation Learning for Single-Channel Source Separation and Bandwidth Extension

In this paper, we use deep representation learning for model-based single-channel source separation (SCSS) and artificial bandwidth extension (ABE). Both tasks are ill-posed and source-specific prior knowledge is required. In addition to well-known generative models such as restricted Boltzmann machines and higher order contractive autoencoders two recently introduced deep models, namely generative stochastic networks (GSNs) and sum-product networks (SPNs), are used for learning spectrogram representations. For SCSS we evaluate the deep architectures on data of the 2 nd CHiME speech separation challenge and provide results for a speaker dependent, a speaker independent, a matched noise condition and an unmatched noise condition task. GSNs obtain the best PESQ and overall perceptual score on average in all four tasks. Similarly, frame-wise GSNs are able to reconstruct the missing frequency bands in ABE best, measured in frequency-domain segmental SNR. They outperform SPNs embedded in hidden Markov models and the other representation models significantly.

[1] Franz Pernkopf,et al. General Stochastic Networks for Classification , 2014, NIPS.

[2] Nitish Srivastava,et al. Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[3] Michael Picheny,et al. Speech recognition using noise-adaptive prototypes , 1989, IEEE Trans. Acoust. Speech Signal Process..

[4] Geoffrey E. Hinton,et al. Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine , 2010, NIPS.

[5] Franz Pernkopf,et al. Modeling speech with sum-product networks: Application to bandwidth extension , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Geoffrey E. Hinton,et al. Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[7] Robert M. Gray,et al. An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[8] DeLiang Wang,et al. On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9] Paris Smaragdis,et al. Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Douglas A. Reynolds,et al. Integrated models of signal and background with application to speaker identification in noise , 1994, IEEE Trans. Speech Audio Process..

[11] Chin-Hui Lee,et al. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[12] Pedro M. Domingos,et al. Discriminative Learning of Sum-Product Networks , 2012, NIPS.

[13] Dong Yu,et al. Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[14] Yoshua Bengio,et al. Deep Generative Stochastic Networks Trainable by Backprop , 2013, ICML.

[15] Pascal Vincent,et al. Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16] Philipos C. Loizou,et al. Speech Enhancement: Theory and Practice , 2007 .

[17] Marc'Aurelio Ranzato,et al. Efficient Learning of Sparse Representations with an Energy-Based Model , 2006, NIPS.

[18] Pascal Vincent,et al. Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[19] Geoffrey E. Hinton,et al. Application of Deep Belief Networks for Natural Language Understanding , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[21] Roland Kuhn,et al. Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[22] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[23] Franz Pernkopf,et al. Single channel source separation with general stochastic networks , 2014, INTERSPEECH.

[24] Jae S. Lim,et al. Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[25] Franz Pernkopf,et al. On optimizing the computational complexity for VQ-based single channel source separation , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26] Yoshua Bengio,et al. Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[27] Li-Rong Dai,et al. A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28] Geoffrey E. Hinton,et al. Factored conditional restricted Boltzmann Machines for modeling motion style , 2009, ICML '09.

[29] Pascal Vincent,et al. Generalized Denoising Auto-Encoders as Generative Models , 2013, NIPS.

[30] Sam T. Roweis,et al. Factorial models and refiltering for speech separation and denoising , 2003, INTERSPEECH.

[31] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[32] Sebastian Tschiatschek,et al. Introduction to Probabilistic Graphical Models , 2014 .

[33] DeLiang Wang,et al. Cocktail Party Processing via Structured Prediction , 2012, NIPS.

[34] Rajat Raina,et al. Efficient sparse coding algorithms , 2006, NIPS.

[35] Geun-Bae Song,et al. A study of HMM-based bandwidth extension of speech signals , 2009, Signal Process..

[36] H. Sebastian Seung,et al. Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[37] Adnan Darwiche,et al. A differential approach to inference in Bayesian networks , 2000, JACM.

[38] Tapani Raiko,et al. Improved Learning of Gaussian-Bernoulli Restricted Boltzmann Machines , 2011, ICANN.

[39] Yoshua Bengio,et al. Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[40] Geoffrey E. Hinton,et al. A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[41] Franz Pernkopf,et al. A factorial sparse coder model for single channel source separation , 2010, INTERSPEECH.

[42] Sebastian Tschiatschek,et al. Parameter Learning of Bayesian Network Classifiers Under Computational Constraints , 2015, ECML/PKDD.

[43] Yoshua Bengio,et al. Regularized Auto-Encoders Estimate Local Statistics , 2012, ICLR.

[44] Li Yao,et al. Multimodal Transitions for Generative Stochastic Networks , 2013, ICLR.

[45] Sebastian Tschiatschek,et al. On Theoretical Properties of Sum-Product Networks , 2015, AISTATS.

[46] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[47] Tijmen Tieleman,et al. Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[48] Sebastian Tschiatschek,et al. On Reduced Precision Bayesian Network Classifiers. , 2015 .

[49] Pedro M. Domingos,et al. Learning the Structure of Sum-Product Networks , 2013, ICML.

[50] Peter Jax,et al. On artificial bandwidth extension of telephone speech , 2003, Signal Process..

[51] Daniel Lowd,et al. Learning Sum-Product Networks with Direct and Indirect Variable Interactions , 2014, ICML.

[52] Franz Pernkopf,et al. On Self-Adaptation in Single-Channel Source Separation , 2014, Interspeech 2014.

[53] Dong Yu,et al. Deep Learning and Its Applications to Signal and Information Processing [Exploratory DSP] , 2011, IEEE Signal Processing Magazine.

[54] Yoshua. Bengio,et al. Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[55] Pascal Vincent,et al. Higher Order Contractive Auto-Encoder , 2011, ECML/PKDD.

[56] Pedro M. Domingos,et al. Sum-product networks: A new deep architecture , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[57] Franz Pernkopf,et al. Sparse nonnegative matrix factorization with ℓ0-constraints , 2012, Neurocomputing.

[58] Paavo Alku,et al. Speech quality prediction for artificial bandwidth extension algorithms , 2013, INTERSPEECH.

[59] Paul Smolensky,et al. Information processing in dynamical systems: foundations of harmony theory , 1986 .

[60] Geoffrey E. Hinton. A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[61] Jon Barker,et al. The second ‘CHiME’ speech separation and recognition challenge: An overview of challenge systems and outcomes , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[62] Paris Smaragdis,et al. Convolutive Speech Bases and Their Application to Supervised Speech Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[63] Methods for objective and subjective assessment of quality Perceptual evaluation of speech quality ( PESQ ) : An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs , 2002 .

[64] Emmanuel Vincent,et al. Subjective and Objective Quality Assessment of Audio Source Separation , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[65] Nir Friedman,et al. Probabilistic Graphical Models - Principles and Techniques , 2009 .

[66] Etsi Secretariat,et al. Digital cellular telecommunications system (Phase 2+); Enhanced Full Rate (EFR) speech transcoding , 1998 .

[67] Sebastian Tschiatschek,et al. Integer Bayesian Networks , 2014 .

[68] Chin-Hui Lee,et al. A deep neural network approach to speech bandwidth expansion , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[69] Mark J. F. Gales,et al. Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[70] Geoffrey E. Hinton,et al. Binary coding of speech spectrograms using a deep auto-encoder , 2010, INTERSPEECH.

[71] Dong Yu,et al. Deep Learning and Its Applications to Signal and Information Processing , 2011 .

[72] Sebastian Tschiatschek,et al. On Bayesian Network Classifiers with Reduced Precision Parameters , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[73] Dan Ventura,et al. Learning the Architecture of Sum-Product Networks Using Clustering on Variables , 2012, NIPS.

[74] Franz Pernkopf,et al. Representation models in single channel source separation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[75] Philip C. Woodland,et al. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[76] Franz Pernkopf,et al. Source–Filter-Based Single-Channel Speech Separation Using Pitch Information , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[77] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[78] Jon Barker,et al. An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[79] Franz Pernkopf,et al. Greedy Part-Wise Learning of Sum-Product Networks , 2013, ECML/PKDD.