Alberto Apostolico y Probabilistic models of various classes of sources are developed in the context of coding and compression as well as in machine learning and classi cation. In the rst domain, the repetitive structures of substrings are regarded as redundancies and sought to be removed. In the second, repeated subpatterns are unveiled as carriers of information and structure. In both contexts, one rather pervasive problem is that of learning or estimating probabilities from the observed strings. For most probabilistic models, such a task poses interesting algorithmic questions (cf., e.g., the references). A popular approach to the statistical modeling of sequences relies on the structure of uniform, xed-memory Markov models. For sequences in important families, the autocorrelation or \memory" exhibited decays exponentially fast with length. In other words, there is a maximum length L of the recent history of a sequence, above which the empirical probability distribution of next symbol given the the last L > L symbols does not change appreciably. It is possible and customary to model these sources by Markov chains of order L, this maximum useful memory length. Even so, such automata tend to be in practice unnecessarily bulky and computationally imposing both during their synthesis and use. In [6], much more compact, tree-shaped variants of probabilistic automata are built which assume an underlying Markov process of variable memory length not exceeding some maximum L. The probability distributions generated by these automata is equivalent to that of a Markov chain of order L, but the description of the automaton itself is much more succinct. The process of learning the automaton from a given training set S of sequences requires (Ln) worst-case time, where n is the total length of the sequences in S and L is the length of a longest substring of S to be considered for a candidate state in the automaton. Once the automaton is built, predicting the likelihood of a query sequence of m characters may cost time (m) in the worst case. This work introduces automata equivalent to PSTs that can be learned in O(n) time, and also discusses notions of empirical probability and their e cient computation. Details of the learning procedure and of a linear time classi er or parser may be found in [2, 3].
[1]
Vineet Bafna,et al.
Pattern Matching Algorithms
,
1997
.
[2]
Golan Yona,et al.
Modeling protein families using probabilistic suffix trees
,
1999,
RECOMB.
[3]
Alfred V. Aho,et al.
The Design and Analysis of Computer Algorithms
,
1974
.
[4]
David Haussler,et al.
The Smallest Automaton Recognizing the Subwords of a Text
,
1985,
Theor. Comput. Sci..
[5]
Alberto Apostolico,et al.
Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space
,
2000,
RECOMB '00.
[6]
JORMA RISSANEN,et al.
A universal data compression system
,
1983,
IEEE Trans. Inf. Theory.
[7]
Wojciech Rytter,et al.
Text Algorithms
,
1994
.
[8]
Peter Weiner,et al.
Linear Pattern Matching Algorithms
,
1973,
SWAT.
[9]
Edward M. McCreight,et al.
A Space-Economical Suffix Tree Construction Algorithm
,
1976,
JACM.