Notes on Learning Probabilistic Automata

Alberto Apostolico y Probabilistic models of various classes of sources are developed in the context of coding and compression as well as in machine learning and classi cation. In the rst domain, the repetitive structures of substrings are regarded as redundancies and sought to be removed. In the second, repeated subpatterns are unveiled as carriers of information and structure. In both contexts, one rather pervasive problem is that of learning or estimating probabilities from the observed strings. For most probabilistic models, such a task poses interesting algorithmic questions (cf., e.g., the references). A popular approach to the statistical modeling of sequences relies on the structure of uniform, xed-memory Markov models. For sequences in important families, the autocorrelation or \memory" exhibited decays exponentially fast with length. In other words, there is a maximum length L of the recent history of a sequence, above which the empirical probability distribution of next symbol given the the last L > L symbols does not change appreciably. It is possible and customary to model these sources by Markov chains of order L, this maximum useful memory length. Even so, such automata tend to be in practice unnecessarily bulky and computationally imposing both during their synthesis and use. In [6], much more compact, tree-shaped variants of probabilistic automata are built which assume an underlying Markov process of variable memory length not exceeding some maximum L. The probability distributions generated by these automata is equivalent to that of a Markov chain of order L, but the description of the automaton itself is much more succinct. The process of learning the automaton from a given training set S of sequences requires (Ln) worst-case time, where n is the total length of the sequences in S and L is the length of a longest substring of S to be considered for a candidate state in the automaton. Once the automaton is built, predicting the likelihood of a query sequence of m characters may cost time (m) in the worst case. This work introduces automata equivalent to PSTs that can be learned in O(n) time, and also discusses notions of empirical probability and their e cient computation. Details of the learning procedure and of a linear time classi er or parser may be found in [2, 3].