Statistical modeling of sequences is a central paradigm of machine learning that finds multiple uses in computational molecular biology and many other domains. The probabilistic automata typically built in these contexts are subtended by uniform, fixed-memory Markov models. In practice, such automata tend to be unnecessarily bulky and computationally imposing both during their synthesis and use. Recently, D. Ron, Y. Singer, and N. Tishby built much more compact, tree-shaped variants of probabilistic automata under the assumption of an underlying Markov process of variable memory length. These variants, called Probabilistic Suffix Trees (PSTs) were subsequently adapted by G. Bejerano and G. Yona and applied successfully to learning and prediction of protein families. The process of learning the automaton from a given training set S of sequences requires theta(Ln2) worst-case time, where n is the total length of the sequences in S and L is the length of a longest substring of S to be considered for a candidate state in the automaton. Once the automaton is built, predicting the likelihood of a query sequence of m characters may cost time theta(m2) in the worst case. The main contribution of this paper is to introduce automata equivalent to PSTs but having the following properties: Learning the automaton, for any L, takes O (n) time. Prediction of a string of m symbols by the automaton takes O (m) time. Along the way, the paper presents an evolving learning scheme and addresses notions of empirical probability and related efficient computation, which is a by-product possibly of more general interest.
[1]
Alfred V. Aho,et al.
Efficient string matching
,
1975,
Commun. ACM.
[2]
Rolf Apweiler,et al.
The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000
,
2000,
Nucleic Acids Res..
[3]
Edward M. McCreight,et al.
A Space-Economical Suffix Tree Construction Algorithm
,
1976,
JACM.
[4]
Golan Yona,et al.
Modeling protein families using probabilistic suffix trees
,
1999,
RECOMB.
[5]
David Haussler,et al.
The Smallest Automaton Recognizing the Subwords of a Text
,
1985,
Theor. Comput. Sci..
[6]
Stefano Lonardi,et al.
Efficient Detection of Unusual Words
,
2000,
J. Comput. Biol..
[7]
Jorma Rissanen,et al.
Complexity of strings in the class of Markov sources
,
1986,
IEEE Trans. Inf. Theory.
[8]
C. D. Gelatt,et al.
Optimization by Simulated Annealing
,
1983,
Science.
[9]
JORMA RISSANEN,et al.
A universal data compression system
,
1983,
IEEE Trans. Inf. Theory.