A Distance for HMMs Based on Aggregated Wasserstein Metric and State Registration

We propose a framework, named Aggregated Wasserstein, for computing a dissimilarity measure or distance between two Hidden Markov Models with state conditional distributions being Gaussian. For such HMMs, the marginal distribution at any time spot follows a Gaussian mixture distribution, a fact exploited to softly match, aka register, the states in two HMMs. We refer to such HMMs as Gaussian mixture model-HMM (GMM-HMM). The registration of states is inspired by the intrinsic relationship of optimal transport and the Wasserstein metric between distributions. Specifically, the components of the marginal GMMs are matched by solving an optimal transport problem where the cost between components is the Wasserstein metric for Gaussian distributions. The solution of the optimization problem is a fast approximation to the Wasserstein metric between two GMMs. The new Aggregated Wasserstein distance is a semi-metric and can be computed without generating Monte Carlo samples. It is invariant to relabeling or permutation of the states. This distance quantifies the dissimilarity of GMM-HMMs by measuring both the difference between the two marginal GMMs and the difference between the two transition matrices. Our new distance is tested on the tasks of retrieval and classification of time series. Experiments on both synthetic data and real data have demonstrated its advantages in terms of accuracy as well as efficiency in comparison with existing distances based on the Kullback-Leibler divergence.

[1]  L. R. Rabiner,et al.  An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition , 1983, The Bell System Technical Journal.

[2]  Lifeng Shang,et al.  Nonparametric discriminant HMM and application to facial expression recognition , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Tony Jebara,et al.  Spectral Clustering and Embedding with Hidden Markov Models , 2007, ECML.

[4]  Frank Chongwoo Park,et al.  Bias Reduction and Metric Learning for Nearest-Neighbor Estimation of Kullback-Leibler Divergence , 2018, Neural Computation.

[5]  Steve Young,et al.  The HTK book , 1995 .

[6]  James Zijun Wang,et al.  Fast Discrete Distribution Clustering Using Wasserstein Barycenter With Sparse Support , 2015, IEEE Transactions on Signal Processing.

[7]  Christoph Bregler,et al.  Learning and recognizing human dynamics in video sequences , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  Gilles Blanchard,et al.  Semi-Supervised Novelty Detection , 2010, J. Mach. Learn. Res..

[9]  Clayton Scott,et al.  A Rate of Convergence for Mixture Proportion Estimation, with Application to Learning from Noisy Labels , 2015, AISTATS.

[10]  Marco Cuturi,et al.  On Wasserstein Two-Sample Testing and Related Families of Nonparametric Tests , 2015, Entropy.

[11]  B. Park,et al.  Estimation of Kullback–Leibler Divergence by Local Likelihood , 2006 .

[12]  Tony Jebara,et al.  Probability Product Kernels , 2004, J. Mach. Learn. Res..

[13]  Padhraic Smyth,et al.  Clustering Sequences with Hidden Markov Models , 1996, NIPS.

[14]  C. Villani Topics in Optimal Transportation , 2003 .

[15]  J. Baker,et al.  The DRAGON system--An overview , 1975 .

[16]  C. Givens,et al.  A class of Wasserstein metrics for probability distributions. , 1984 .

[17]  Biing-Hwang Juang,et al.  Hidden Markov Models for Speech Recognition , 1991 .

[18]  Mikael Nilsson,et al.  Speech Recognition using Hidden Markov Model , 2002 .

[19]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[20]  Yohji Akama,et al.  VC dimension of ellipsoids , 2011, ArXiv.

[21]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[22]  Vladimir Pavlovic,et al.  Discovering clusters in motion time-series data , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[23]  L. R. Rabiner,et al.  A probabilistic distance measure for hidden Markov models , 1985, AT&T Technical Journal.

[24]  James M. Rehg,et al.  A data-driven approach to quantifying natural human motion , 2005, SIGGRAPH '05.

[25]  Ramakant Nevatia,et al.  Recognition and Segmentation of 3-D Human Action Using HMM and Multi-class AdaBoost , 2006, ECCV.

[26]  Antoni B. Chan,et al.  Clustering hidden Markov models with variational HEM , 2012, J. Mach. Learn. Res..

[27]  Joydeep Ghosh,et al.  A Unified Framework for Model-based Clustering , 2003, J. Mach. Learn. Res..

[28]  Les E. Atlas,et al.  The challenge of spoken language systems: research directions for the nineties , 1995, IEEE Trans. Speech Audio Process..