Large-margin conditional random fields for single-microphone speech separation

Conditional random field (CRF) formulations for singlemicrophone speech separation are improved by large-margin parameter estimation. Speech sources are represented by acoustic state sequences from speaker-dependent acoustic models. The large-margin technique improves the classification accuracy of acoustic states by reducing generalization error in the training phase. Non-linear mappings inspired from the mixturemaximization (MIXMAX) model are applied to speech mixture observations. Compared with a factorial hidden Markov model baseline, the improved CRF formulations achieve better separation performance with significantly fewer training data. The separation performance is evaluated in terms of objective speech quality measures and speech recognition accuracy on the reconstructed sources. Compared with the CRF formulations without large-margin parameter estimation, the improved formulations achieve better performance without modifying the statistical inference procedures, especially when the sources are modeled with increased number of acoustic states.

[1]  Kilian Q. Weinberger,et al.  Convex Optimizations for Distance Metric Learning and Pattern Classification [Applications Corner] , 2010, IEEE Signal Processing Magazine.

[2]  Cheung-Chi Leung,et al.  Using dynamic conditional random field on single-microphone speech separation , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  John R. Hershey,et al.  Single-channel speech separation and recognition using loopy belief propagation , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[5]  Tom Minka,et al.  Principled Hybrids of Generative and Discriminative Models , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[6]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[7]  John R. Hershey,et al.  Single-Channel Multitalker Speech Recognition , 2010, IEEE Signal Processing Magazine.

[8]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[9]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[10]  Eric Fosler-Lussier,et al.  Monaural segregation of voiced speech using discriminative random fields , 2009, INTERSPEECH.

[11]  Sam T. Roweis,et al.  One Microphone Source Separation , 2000, NIPS.

[12]  Lakhmi C. Jain,et al.  Introduction to Bayesian Networks , 2008 .

[13]  Mark W. Schmidt,et al.  Accelerated training of conditional random fields with stochastic gradient methods , 2006, ICML.

[14]  A. Nadas,et al.  Speech recognition using noise-adaptive prototypes , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[15]  Michael Picheny,et al.  On a model-robust training method for speech recognition , 1988, IEEE Trans. Acoust. Speech Signal Process..

[16]  Cheung-Chi Leung,et al.  Integrating multiple observations for model-based single-microphone speech separation with conditional random fields , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[18]  Steve Young,et al.  The HTK book , 1995 .

[19]  John R. Hershey,et al.  Signal interaction and the devil function , 2010, INTERSPEECH.

[20]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[21]  Li Deng,et al.  A Geometric Perspective of Large-Margin Training of Gaussian Models [Lecture Notes] , 2010, IEEE Signal Processing Magazine.

[22]  Eric Fosler-Lussier,et al.  Conditional Random Fields for Integrating Local Discriminative Classifiers , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Sharon Gannot,et al.  Speech enhancement using a mixture-maximum model , 1999, IEEE Trans. Speech Audio Process..

[25]  Hui Jiang,et al.  Parameter Estimation of Statistical Models Using Convex Optimization , 2010, IEEE Signal Processing Magazine.

[26]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[27]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[28]  W. Freeman,et al.  Generalized Belief Propagation , 2000, NIPS.

[29]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[30]  A. Banihashemi,et al.  Nonlinear minimum mean square error estimator for mixture-maximisation approximation , 2006 .

[31]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[32]  Lawrence K. Saul,et al.  Large Margin Hidden Markov Models for Automatic Speech Recognition , 2006, NIPS.

[33]  Michael I. Jordan,et al.  An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators , 2008, ICML '08.

[34]  Eric Fosler-Lussier,et al.  Conditional Random Fields in Speech, Audio, and Language Processing , 2013, Proceedings of the IEEE.

[35]  Neri Merhav,et al.  Maximum likelihood hidden Markov modeling using a dominant sequence of states , 1991, IEEE Trans. Signal Process..

[36]  Minyoung Kim Large margin cost-sensitive learning of conditional random fields , 2010, Pattern Recognit..