A SPECTRAL ALGORITHM FOR LEARNING HIDDEN MARKOV MODELS THAT HAVE SILENT STATES

OVERVIEW The literature on Probably Approximately Correct (PAC) learning of Hidden Markov Models (HMMs) reveals an intimate, bijective relationship between the (unknown) sequence of states in the HMM and the string of outputs generated by those states. For simplicity, I refer to each such output of a state as a " letter " from some appropriate alphabet, and a sequence of letters is referred to as a " string ". In the normal HMM model, each state emits one letter at every time step with probability one, and so each letter received by the learning algorithm is known to match some unknown state in the HMM. Existing PAC learning models are predicated upon this feature. Appendix A provides a brief overview of some known results for learning HMMs. However, there are situations in which the HMM to be learned includes " silent states " , which do not emit a letter. HMMs with silent states can often provide a very compact representation of the phenomena being modeled, and can greatly reduce the number of transitions among states. HMMs with silent states may more naturally model reality, or may be especially advantageous in environments in which computational processing depends on the number of transitions. As one example, HMMs with silent states are employed in computational biology to model families of related genetic sequences and determine which sequences belong to which family. Related sequences match each other along certain portions of their " strings " but not along other portions due to insertions or deletions at random points (e.g., due to replication errors, evolution or other changes in one of the organisms generating some but not all of the sequences). When applied to this use, the HMMs are known as " profile HMMs ". The states of a profile HMM have a very regular transition structure, as depicted in Figure 1 below. The discussion that follows outlines an algorithm for learning HMMs with silent states. Clearly, the introduction of an unknown number of silent states complicates the learning process: there is no longer a one-to-one correspondence between outputs, which are observable, and the sequence of states transitioned through, which are never known. In particular, the algorithm presented here builds upon the algorithm and analysis developed by Hsu, Kakade, and Zhang (hereinafter referred to as the "HKZ algorithm") in order to accommodate silent states. Familiarity with that algorithm and the notation in [Hsu …