Hidden Markov Models with mixtures as emission distributions

In unsupervised classification, Hidden Markov Models (HMM) are used to account for a neighborhood structure between observations. The emission distributions are often supposed to belong to some parametric family. In this paper, a semiparametric model where the emission distributions are a mixture of parametric distributions is proposed to get a higher flexibility. We show that the standard EM algorithm can be adapted to infer the model parameters. For the initialization step, starting from a large number of components, a hierarchical method to combine them into the hidden states is proposed. Three likelihood-based criteria to select the components to be combined are discussed. To estimate the number of hidden states, BIC-like criteria are derived. A simulation study is carried out both to determine the best combination between the combining criteria and the model selection criteria and to evaluate the accuracy of classification. The proposed method is also illustrated using a biological dataset from the model plant Arabidopsis thaliana. A R package HMMmix is freely available on the CRAN.

[1]  Sotirios Chatzis,et al.  Hidden Markov Models with Nonelliptically Contoured State Densities , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Adrian E. Raftery,et al.  MCLUST: Software for Model-Based Cluster Analysis , 1999 .

[3]  Stéphane Robin,et al.  Unsupervised Classification for Tiling Arrays: ChIP-chip and Transcriptome , 2011, Statistical applications in genetics and molecular biology.

[4]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[5]  P. Deb Finite Mixture Models , 2008 .

[6]  Eric Moulines,et al.  Inference in hidden Markov models , 2010, Springer series in statistics.

[7]  Michael Black,et al.  Role of transposable elements in heterochromatin and epigenetic control , 2004, Nature.

[8]  Stéphane Robin,et al.  Integrative epigenomic mapping defines four main chromatin states in Arabidopsis , 2011, The EMBO journal.

[9]  Jia Li Clustering Based on a Multilayer Mixture Model , 2005 .

[10]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Wenguang Sun,et al.  Large‐scale multiple testing under dependence , 2009 .

[12]  Gilles Celeux,et al.  Combining Mixture Components for Clustering , 2010, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[13]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[14]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[15]  Haikady N. Nagaraja,et al.  Inference in Hidden Markov Models , 2006, Technometrics.

[16]  Eric Moulines,et al.  Inference in Hidden Markov Models (Springer Series in Statistics) , 2005 .

[17]  Christian Hennig,et al.  Methods for merging Gaussian mixture components , 2010, Adv. Data Anal. Classif..

[18]  Gilles Celeux,et al.  Selecting hidden Markov model state number with cross-validated likelihood , 2008, Comput. Stat..

[19]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[20]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[21]  Clifford A. Meyer,et al.  A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences , 2005, ISMB.