Prediction suffix trees for supervised classification of sequences

This paper presents a statistical test and algorithms for patterns extraction and supervised classification of sequential data. First it defines the notion of prediction suffix tree (PST). This type of tree can be used to efficiently describe variable order chain. It performs better than the Markov chain of order L and at a lower storage cost. We propose an improvement of this model, based on a statistical test. This test enables us to control the risk of encountering different patterns in the model of the sequence to classify and in the model of its class. Applications to biological sequences are presented to illustrate this procedure. We compare the results obtained with different models (Markov chain of order L, Variable order model and the statistical test, with or without smoothing). We set out to show how the choice of the parameters of the models influences performance in these applications. Obviously these algorithms can be used in other fields in which the data are naturally ordered.

[1]  R. Fisher Statistical methods for research workers , 1927, Protoplasma.

[2]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[3]  E. Lehmann Testing Statistical Hypotheses , 1960 .

[4]  P. Greenwood,et al.  A Guide to Chi-Squared Testing , 1996 .

[5]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[6]  Fred R. McMorris,et al.  Alignment, Comparison and Consensus of Molecular Sequences , 1994 .

[7]  Jorja G. Henikoff,et al.  Using substitution probabilities to improve position-specific scoring matrices , 1996, Comput. Appl. Biosci..

[8]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[9]  Dana Ron,et al.  The power of amnesia: Learning probabilistic automata with variable memory length , 1996, Machine Learning.

[10]  Pierre Dupont,et al.  Improved Smoothing for Probabilistic Suffix Trees Seen as Variable Order Markov Chains , 2002, ECML.

[11]  Nitin R. Patel,et al.  A Network Algorithm for Performing Fisher's Exact Test in r × c Contingency Tables , 1983 .

[12]  P. Bühlmann,et al.  Variable Length Markov Chains , 1999 .

[13]  M. Kendall,et al.  Kendall's advanced theory of statistics , 1995 .

[14]  JORMA RISSANEN,et al.  A universal data compression system , 1983, IEEE Trans. Inf. Theory.

[15]  M. Schader,et al.  New Approaches in Classification and Data Analysis , 1994 .

[16]  Abraham Lempel,et al.  A sequential algorithm for the universal coding of finite memory sources , 1992, IEEE Trans. Inf. Theory.

[17]  Golan Yona,et al.  Variations on probabilistic suffix trees: statistical modeling and prediction of protein families , 2001, Bioinform..