Profile Hidden Markov Models are Not Identifiable.

Profile Hidden Markov Models (HMMs) are graphical models that can be used to produce finite length sequences from a distribution. They have multiple applications, including protein structure and function prediction, classifications of novel proteins into existing protein families and superfamilies, metagenomics, and multiple sequence alignment. The standard use of profile HMMs in bioinformatics has two steps: first a profile HMM is built for a collection of molecular sequences (which may not be in a multiple sequence alignment), and then the profile HMM is used in some subsequent analysis of new molecular sequences. The construction of the profile thus is itself a statistical estimation problem, since any given set of sequences might potentially fit more than one model well. Hence a basic question about profile HMMs is whether they are statistically identifiable, which means that no two profile HMMs can produce the same distribution on finite length sequences. Indeed, statistical identifiability is a fundamental aspect of any statistical model, and yet it is not known whether profile HMMs are statistically identifiable. In this paper, we report on preliminary results towards characterizing the statistical identifiability of profile HMMs in one of the standard forms used in bioinformatics.

[1]  A. G. Pedersen,et al.  Computational Molecular Evolution , 2013 .

[2]  Kimmen Sjölander,et al.  Phylogenomic inference of protein molecular function: advances and challenges , 2004, Bioinform..

[3]  Tobias Müller,et al.  Modelling interaction sites in protein domains with interaction profile hidden Markov models , 2006, Bioinform..

[4]  Alex Bateman,et al.  An introduction to hidden Markov models. , 2007, Current protocols in bioinformatics.

[5]  Alon Orlitsky,et al.  On Learning Distributions from their Samples , 2015, COLT.

[6]  Zemin Zhang,et al.  A profile hidden Markov model for signal peptides generated by HMMER , 2003, Bioinform..

[7]  C. Matias,et al.  Identifiability of parameters in latent structure models with many observed variables , 2008, 0809.5032.

[8]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[9]  Rachel J. Mackay,et al.  Estimating the order of a hidden markov model , 2002 .

[10]  Mihai Pop,et al.  TIPP: taxonomic identification and phylogenetic profiling , 2014, Bioinform..

[11]  David A. Freedman,et al.  Statistical Models: Theory and Practice: References , 2005 .

[12]  D. Haussler,et al.  Protein modeling using hidden Markov models: analysis of globins , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[13]  Tandy J. Warnow,et al.  SEPP: SATe -Enabled Phylogenetic Placement , 2011, Pacific Symposium on Biocomputing.

[14]  Judea Pearl,et al.  Structuring causal trees , 1986, J. Complex..

[15]  Elchanan Mossel,et al.  Phylogenetic mixtures: Concentration of measure in the large-tree limit , 2011, ArXiv.

[16]  P. Stark Inverse problems as statistics , 2002 .

[17]  Elchanan Mossel,et al.  Mixed-up Trees: the Structure of Phylogenetic Mixtures , 2007, Bulletin of mathematical biology.

[18]  Tandy J. Warnow,et al.  Ultra-large alignments using phylogeny-aware profiles , 2015, Genome Biology.

[19]  L. Kubatko,et al.  Identifiability and Reconstructibility of Species Phylogenies Under a Modified Coalescent , 2017, Bulletin of Mathematical Biology.

[20]  Ingo Ebersberger,et al.  HaMStR: Profile hidden markov model based search for orthologs in ESTs , 2009, BMC Evolutionary Biology.

[21]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[22]  Joseph T. Chang,et al.  Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. , 1996, Mathematical biosciences.