Variational conditional random fields for online speaker detection and tracking

There are many references that concern a specific aspect of speaker tracking. This paper focuses on the speaker modeling issue and proposes conditional random fields (CRF) for this purpose. CRF is a class of undirected graphical models for classifying sequential data. CRF has some interesting characteristics which have encouraged us to use this model in a speaker modeling and tracking task. The main concern of CRF model is its training. Known approaches for CRF training are prone to overfitting and unreliable convergence. To solve this problem, variational approaches are proposed in this paper. The main novelty of this paper is to adapt variational framework for CRF training. The resulted approach is evaluated on three different areas. First, the best CRF model configuration for speaker modeling is evaluated on text independent speaker verification. Next, the selected model is used in a speaker detection task, in which the models of the existing speakers in the conversation are known a priori. Then, the proposed CRF approach is compared with GMM in an online speaker tracking framework. The results show that the proposed CRF model is superior to GMM in speaker detection and tracking, due to its capability for sequence modeling and segmentation.

[1]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[2]  Özgür Izmirli,et al.  Using a Spectral Flatness Based Feature for Audio Segmentation and Retrieval , 2000, ISMIR.

[3]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[4]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[5]  Jean-Yves Tourneret,et al.  Supervised classification using MCMC methods , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[6]  Tatsuya Kawahara,et al.  Using online model comparison in the Variational Bayes framework for online unsupervised Voice Activity Detection , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Yasubumi Sakakibara,et al.  RNA secondary structural alignment with conditional random fields , 2005, ECCB/JBI.

[8]  Mohammad Hossein Moattar,et al.  A simple but efficient real-time Voice Activity Detection algorithm , 2009, 2009 17th European Signal Processing Conference.

[9]  Nikki Mirghafori,et al.  Nuts and Flakes: a Study of Data Characteristics in Speaker Diarization , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[10]  Burr Settles ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[11]  Mark W. Schmidt,et al.  Structure learning in random fields for heart motion abnormality detection , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Bin Ma,et al.  Joint map adaptation of feature transformation and Gaussian Mixture Model for speaker recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Satoshi Nakamura,et al.  Never-ending learning with dynamic hidden Markov network , 2007, INTERSPEECH.

[14]  Eric Fosler-Lussier,et al.  Backpropagation training for multilayer conditional random field based phone recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[16]  G. Parisi,et al.  Statistical Field Theory , 1988 .

[17]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Ben Taskar,et al.  Introduction to statistical relational learning , 2007 .

[19]  Panu Somervuo Speech modeling using variational Bayesian mixture of Gaussians , 2002, INTERSPEECH.

[20]  Trevor Darrell,et al.  Latent-Dynamic Discriminative Models for Continuous Gesture Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[22]  Bernt Schiele,et al.  Discriminative structure learning of hierarchical representations for object detection , 2009, CVPR.

[23]  Jonathan G. Fiscus,et al.  NIST Rich Transcription 2002 Evaluation: A Preview , 2002, LREC.

[24]  Andrew McCallum,et al.  Efficiently Inducing Features of Conditional Random Fields , 2002, UAI.

[25]  Constantine Kotropoulos,et al.  Speaker segmentation and clustering , 2008, Signal Process..

[26]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[27]  Fabio Valente,et al.  Variational Bayesian Methods for Audio Indexing , 2005, MLMI.

[28]  Dan Cornford,et al.  A Comparison of Variational and Markov Chain Monte Carlo Methods for Inference in Partially Observed Stochastic Dynamic Systems , 2007, J. Signal Process. Syst..

[29]  G. Casella,et al.  Explaining the Gibbs Sampler , 1992 .

[30]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[31]  Ronald A. Cole,et al.  The OGI multi-language telephone speech corpus , 1992, ICSLP.

[32]  Chellu Chandra Sekhar,et al.  Variational Bayes Adapted GMM Based Models for Audio Clip Classification , 2009, PReMI.

[33]  Daniel Jurafsky,et al.  Regularization, adaptation, and non-independent features improve hidden conditional random fields for phone classification , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[34]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[35]  Satoshi Nakamura,et al.  Improved novelty detection for online GMM based speaker diarization , 2008, INTERSPEECH.

[36]  Haizhou Li,et al.  An SVM Kernel With GMM-Supervector Based on the Bhattacharyya Distance for Speaker Recognition , 2009, IEEE Signal Processing Letters.

[37]  Luis Javier Rodríguez-Fuentes,et al.  Low-latency online speaker tracking on the AMI Corpus of meeting conversations , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[38]  Xihong Wu,et al.  GMM-HMM acoustic model training by a two level procedure with Gaussian components determined by automatic model selection , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[39]  Kurt Keutzer,et al.  Fast speaker diarization using a high-level scripting language , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[40]  Martial Hebert,et al.  Discriminative Fields for Modeling Spatial Dependencies in Natural Images , 2003, NIPS.

[41]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[42]  William M. Campbell,et al.  Support vector machines for speaker and language recognition , 2006, Comput. Speech Lang..

[43]  Yee Whye Teh,et al.  Collapsed Variational Inference for HDP , 2007, NIPS.

[44]  Liang Lu,et al.  Variational Bayesian Joint factor analysis for speaker verification , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[45]  Hagai Attias,et al.  Inferring Parameters and Structure of Latent Variable Models by Variational Bayes , 1999, UAI.

[46]  Rong Tong,et al.  The I4U system in NIST 2008 speaker recognition evaluation , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[47]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[48]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[49]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[50]  Satoshi Nakamura,et al.  Never-ending learning system for on-line speaker diarization , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[51]  S. Narayanan,et al.  Unsupervised speaker indexing using generic models , 2005, IEEE Transactions on Speech and Audio Processing.

[52]  Zhijian Ou,et al.  Variational nonparametric Bayesian Hidden Markov Model , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[53]  Shrikanth S. Narayanan,et al.  Speaker model quantization for unsupervised speaker indexing , 2004, INTERSPEECH.

[54]  Matthew J. Beal Variational algorithms for approximate Bayesian inference , 2003 .

[55]  Alvin F. Martin,et al.  Speaker recognition in a multi-speaker environment , 2001, INTERSPEECH.

[56]  N. Nasios,et al.  Variational learning for Gaussian mixture models , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[57]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[58]  Jaime G. Carbonell,et al.  Segmentation Conditional Random Fields (SCRFs): A New Approach for Protein Fold Recognition , 2005, RECOMB.

[59]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[60]  Dong Yu,et al.  Language recognition using deep-structured conditional random fields , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[61]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[62]  Roland Auckenthaler,et al.  Score Normalization for Text-Independent Speaker Verification Systems , 2000, Digit. Signal Process..

[63]  Chellu Chandra Sekhar,et al.  Variational Gaussian Mixture Models for Speech Emotion Recognition , 2009, 2009 Seventh International Conference on Advances in Pattern Recognition.

[64]  Jen-Tzung Chien,et al.  Variational inference for conditional random fields , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.