Nested Gibbs sampling for mixture-of-mixture model and its application to speaker clustering

This paper proposes a novel model estimation method, which uses nested Gibbs sampling to develop a mixture-of-mixture model to represent the distribution of the model's components with a mixture model. This model is suitable for analyzing multilevel data comprising frame-wise observations, such as videos and acoustic signals, which are composed of frame-wise observations. Deterministic procedures, such as the expectation–maximization algorithm have been employed to estimate these kinds of models, but this approach often suffers from a large bias when the amount of data is limited. To avoid this problem, we introduce a Markov chain Monte Carlo-based model estimation method. In particular, we aim to identify a suitable sampling method for the mixture-of-mixture models. Gibbs sampling is a possible approach, but this can easily lead to the local optimum problem when each component is represented by a multi-modal distribution. Thus, we propose a novel Gibbs sampling method, called “nested Gibbs sampling,” which represents the lower-level (fine) data structure based on elemental mixture distributions and the higher-level (coarse) data structure based on mixture-of-mixture distributions. We applied this method to a speaker clustering problem and conducted experiments under various conditions. The results demonstrated that the proposed method outperformed conventional sampling-based, variational Bayesian, and hierarchical agglomerative methods.

[1]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[2]  Jun S. Liu,et al.  Monte Carlo strategies in scientific computing , 2001 .

[3]  Yee Whye Teh,et al.  A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation , 2006, NIPS.

[4]  Fabio Valente,et al.  Variational Bayesian adaptation for speaker clustering , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[5]  M. Wand,et al.  EXACT MEAN INTEGRATED SQUARED ERROR , 1992 .

[6]  Thomas S. Huang,et al.  Generative model-based speaker clustering via mixture of von Mises-Fisher distributions , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[8]  Tatsuya Kawahara,et al.  Automatic transcription of spontaneous lecture speech , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[9]  P. Motlícek,et al.  Variational Bayesian speaker diarization of meeting recordings , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Jeroen K. Vermunt A hierarchical mixture model for clustering three-way data sets , 2007, Comput. Stat. Data Anal..

[11]  Paul D. McNicholas,et al.  Model-based classification via mixtures of multivariate t-distributions , 2011, Comput. Stat. Data Anal..

[12]  Tetsuji Ogawa,et al.  A sampling-based speaker clustering using utterance-oriented Dirichlet process mixture model and its evaluation on large-scale data , 2015, APSIPA Transactions on Signal and Information Processing.

[13]  Baba C. Vemuri,et al.  Using the KL-center for efficient and accurate retrieval of distributions arising from texture images , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[14]  K VermuntJeroen A hierarchical mixture model for clustering three-way data sets , 2007 .

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  Herbert Gish,et al.  Clustering speakers by their voices , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[17]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[18]  Wojtek J. Krzanowski,et al.  Mixture separation for mixed-mode data , 1996, Stat. Comput..

[19]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[20]  Tetsuji Ogawa,et al.  Fully Bayesian inference of multi-mixture Gaussian model and its evaluation using speaker clustering , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Robert J. Boik,et al.  Identifiable finite mixtures of location models for clustering mixed-mode data , 1999, Stat. Comput..

[22]  Tetsuji Ogawa,et al.  Fully Bayesian speaker clustering based on hierarchically structured utterance-oriented Dirichlet process mixture model , 2012, INTERSPEECH.

[23]  Angela Montanari,et al.  A hierarchical modeling approach for clustering probability density functions , 2014, Comput. Stat. Data Anal..

[24]  A. Gelfand,et al.  The Nested Dirichlet Process , 2008 .

[25]  Shuichi Itahashi,et al.  On recent speech corpora activities in Japan , 1999 .

[26]  Zoubin Ghahramani,et al.  Latent-Space Variational Bayes , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Inderjit S. Dhillon,et al.  Clustering on the Unit Hypersphere using von Mises-Fisher Distributions , 2005, J. Mach. Learn. Res..

[28]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[29]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[30]  Jay Magidson,et al.  Hierarchical Mixture Models for Nested Data Structures , 2004, GfKl.