Learning Complex and Sparse Events in Long Sequences

The Hierarchical Hidden Markov Model (HHMM) is a well formalized tool suitable to model complex patterns in long temporal or spatial sequences. Even if effective algorithms are available to estimate HHMM parameters from sequences, little has been done in order to automatize the construction of the model architecture. The primary focus of this paper is on a multi-strategy algorithm for inferring the HHMM structure from a set of sequences, where the events to capture are present in a relevant portion of them. The algorithm follows a bottom-up strategy, in which elementary facts in the sequences are progressively grouped, thus building the abstraction hierarchy of a HHMM, layer after layer. In this process, clustering algorithms and sequence alignment algorithms, widely used in domains like molecular biology, are exploited. The induction strategy has been designed in order to deal with events characterized by a sparse structure, where gaps filled by irrelevant facts can be intermixed with the relevant ones. Irrelevant facts are modeled by "gaps ", i.e., HMMs of the noise. Gaps are hypothesized when there is no significant statistical evidence for hypothesizing the existence of a specific episode. Moreover, gaps can be replaced in a second time by a episode model, after new facts have been acquired. The method is extensively evaluated on artificial datasets.