A Sticky HDP-HMM With Application to Speaker Diarization

We consider the problem of speaker diarization, the problem of segmenting an audio recording of a meeting into temporal segments corresponding to individual speakers. The problem is rendered particularly difficult by the fact that we are not allowed to assume knowledge of the number of people participating in the meeting. To address this problem, we take a Bayesian nonparametric approach to speaker diarization that builds on the hierarchical Dirichlet process hidden Markov model (HDP-HMM) of Teh et al. [J. Amer. Statist. Assoc. 101 (2006) 1566--1581]. Although the basic HDP-HMM tends to over-segment the audio data---creating redundant states and rapidly switching among them---we describe an augmented HDP-HMM that provides effective control over the switching rate. We also show that this augmentation makes it possible to treat emission distributions nonparametrically. To scale the resulting architecture to realistic diarization problems, we develop a sampling algorithm that employs a truncated approximation of the Dirichlet process to jointly resample the full state sequence, greatly improving mixing rates. Working with a benchmark NIST data set, we show that our Bayesian nonparametric architecture yields state-of-the-art speaker diarization results.

[1]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[2]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .

[3]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[4]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[5]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[6]  Pierre Falzon,et al.  Institut national de recherche en informatique et en automatique , 1992 .

[7]  Christian P. Robert,et al.  The Bayesian choice , 1994 .

[8]  C. Lee Giles,et al.  Neural Information Processing Systems 7 , 1995 .

[9]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[10]  Alexander J. Smola,et al.  Neural Information Processing Systems , 1997, NIPS 1997.

[11]  M. A. Siegler,et al.  Automatic Segmentation, Classification and Clustering of Broadcast News Audio , 1997 .

[12]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[13]  Jean-Luc Gauvain,et al.  Partitioning and transcription of broadcast news data , 1998, ICSLP.

[14]  Jean-François Bonastre,et al.  Evolutive HMM for multi-speaker tracking system , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[15]  Martin J. Wainwright,et al.  Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles , 2000, NIPS.

[16]  INDEX to INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS , 2000 .

[17]  H. Ishwaran,et al.  Markov chain Monte Carlo in approximate Dirichlet and beta two-parameter process hierarchical models , 2000 .

[18]  Carl E. Rasmussen,et al.  Factorial Hidden Markov Models , 1997 .

[19]  Jean-François Bonastre,et al.  E-HMM approach for learning and adapting sound models for speaker indexing , 2001, Odyssey.

[20]  S. L. Scott Bayesian Methods for Hidden Markov Models , 2002 .

[21]  H. Ishwaran,et al.  Exact and approximate sum representations for the Dirichlet process , 2002 .

[22]  H. Ishwaran,et al.  DIRICHLET PRIOR SIEVES IN FINITE NORMAL MIXTURES , 2002 .

[23]  William T. Freeman,et al.  Nonparametric belief propagation , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[24]  William T. Freeman,et al.  Efficient Multiscale Sampling from Products of Gaussian Mixtures , 2003, NIPS.

[25]  Jean-Luc Gauvain,et al.  Improving Speaker Diarization , 2004 .

[26]  D A Reynolds,et al.  The MIT Lincoln Laboratory RT-04F Diarization Systems: Applications to Broadcast Audio and Telephone Conversations , 2004 .

[27]  Michael I. Mandel,et al.  Distributed Occlusion Reasoning for Tracking with Nonparametric Belief Propagation , 2004, NIPS.

[28]  Radford M. Neal,et al.  A Split-Merge Markov chain Monte Carlo Procedure for the Dirichlet Process Mixture Model , 2004 .

[29]  Barbara Peskin,et al.  TOWARDS ROBUST SPEAKER SEGMENTATION: THE ICSI-SRI FALL 2004 DIARIZATION SYSTEM , 2004 .

[30]  Michael I. Mandel,et al.  Visual Hand Tracking Using Nonparametric Belief Propagation , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[31]  Martin J. Wainwright,et al.  Embedded trees: estimation of Gaussian Processes on graphs with cycles , 2004, IEEE Transactions on Signal Processing.

[32]  Antonio Torralba,et al.  Learning hierarchical models of scenes, objects, and parts , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[33]  Ajay Jasra,et al.  Markov Chain Monte Carlo Methods and the Label Switching Problem in Bayesian Mixture Modeling , 2005 .

[34]  Antonio Torralba,et al.  Depth from Familiar Objects: A Hierarchical Model for 3D Scenes , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[35]  Matthew J. Beal,et al.  Gene Expression Time Course Clustering with Countably Infinite Hidden Markov Models , 2006, UAI.

[36]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[37]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[38]  Stephen G. Walker,et al.  Sampling the Dirichlet Mixture Model with Slices , 2006, Commun. Stat. Simul. Comput..

[39]  G. Roberts,et al.  Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models , 2007, 0710.4228.

[40]  Antonio Torralba,et al.  Describing Visual Scenes Using Transformed Objects and Parts , 2008, International Journal of Computer Vision.

[41]  Marijn Huijbregts,et al.  The ICSI RT07s Speaker Diarization System , 2007, CLEAR.

[42]  Erik B. Sudderth,et al.  Loop Series and Bethe Variational Bounds in Attractive Graphical Models , 2007, NIPS.

[43]  Michael I. Jordan,et al.  Image Denoising with Nonparametric Hidden Markov Trees , 2007, 2007 IEEE International Conference on Image Processing.

[44]  Eric P. Xing,et al.  Hidden Markov Dirichlet process: modeling genetic inference in open ancestral space , 2007 .

[45]  Yee Whye Teh,et al.  Collapsed Variational Dirichlet Process Mixture Models , 2007, IJCAI.

[46]  Mark Johnson,et al.  Why Doesn’t EM Find Good HMM POS-Taggers? , 2007, EMNLP.

[47]  Michael I. Jordan,et al.  Learning Multiscale Representations of Natural Scenes Using Dirichlet Processes , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[48]  Mark J. F. Gales,et al.  The Application of Hidden Markov Models in Speech Recognition , 2007, Found. Trends Signal Process..

[49]  Michael I. Jordan,et al.  An HDP-HMM for systems with state persistence , 2008, ICML '08.

[50]  Yee Whye Teh,et al.  Beam sampling for the infinite hidden Markov model , 2008, ICML '08.

[51]  PROCEssIng magazInE IEEE Signal Processing Magazine , 2004 .

[52]  Michael I. Jordan,et al.  Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes , 2008, NIPS.

[53]  A. Gelfand,et al.  The Nested Dirichlet Process , 2008 .

[54]  Perry R. Cook,et al.  Data-Driven Recomposition using the Hierarchical Dirichlet Process Hidden Markov Model , 2009, ICMC.

[55]  Michael I. Jordan,et al.  Vertically Integrated Seismological Analysis I : Modeling , 2009 .

[56]  Michael I. Jordan,et al.  Nonparametric Bayesian Identification of Jump Systems with Sparse Dependencies , 2009 .

[57]  Michael I. Jordan,et al.  Sharing Features among Dynamical Systems with Beta Processes , 2009, NIPS.

[58]  Kenneth Y. Goldberg,et al.  Nonparametric belief propagation for distributed tracking of robot networks with noisy inter-distance measurements , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[59]  Stuart J. Russell,et al.  Global seismic monitoring as probabilistic inference , 2010, NIPS.

[60]  Martin J. Wainwright,et al.  Major Advances and Emerging Developments of Graphical Models [From the Guest Editors] , 2010 .

[61]  Andrew C. Miller Image and Audio Annotation : Approximate Inference in Dense Conditional Random Fields , 2010 .

[62]  Michael J. Black,et al.  Layered image motion with explicit occlusions, temporal consistency, and depth ordering , 2010, NIPS.

[63]  Rajkumar Kothapa,et al.  Max-Product Particle Belief Propagation , 2011 .

[64]  Soumya Ghosh,et al.  Spatial distance dependent Chinese restaurant processes for image segmentation , 2011, NIPS.

[65]  Erik B. Sudderth,et al.  The Doubly Correlated Nonparametric Topic Model , 2011, NIPS.

[66]  Erik B. Sudderth,et al.  Improved variational inference for tracking in clutter , 2012, 2012 IEEE Statistical Signal Processing Workshop (SSP).

[67]  Erik B. Sudderth,et al.  Annual grassland resource pools and fluxes: sensitivity to precipitation and dry periods on two contrasting soils , 2012 .

[68]  Soumya Ghosh,et al.  Nonparametric learning for layered segmentation of natural images , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[69]  Erik B. Sudderth,et al.  Nonparametric discovery of activity patterns from video collections , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[70]  Soumya Ghosh,et al.  From Deformations to Parts: Motion-based Segmentation of 3D Objects , 2012, NIPS.

[71]  Erik B. Sudderth,et al.  Minimization of Continuous Bethe Approximations: A Positive Variation , 2012, NIPS.

[72]  Soravit Changpinyo Learning Image Attributes using the Indian Buffet Process , 2012 .

[73]  David M. Blei,et al.  Efficient Online Inference for Bayesian Nonparametric Relational Models , 2013, NIPS.

[74]  Erik B. Sudderth,et al.  Memoized Online Variational Inference for Dirichlet Process Mixture Models , 2013, NIPS.