Mining Sequence Classifiers for Early Prediction

Supervised learning on sequence data, also known as sequence classification, has been well recognized as an important data mining task with many significant applications. Since temporal order is important in sequence data, in many critical applications of sequence classification such as medical diagnosis and disaster prediction, early prediction is a highly desirable feature of sequence classifiers. In early prediction, a sequence classifier should use a prefix of a sequence as short as possible to make a reasonably accurate prediction. To the best of our knowledge, early prediction on sequence data has not been studied systematically. In this paper, we identify the novel problem of mining sequence classifiers for early prediction. We analyze the problem and the challenges. As the first attempt to tackle the problem, we propose two interesting methods. The sequential classification rule (SCR) method mines a set of sequential classification rules as a classifier. A so-called early-prediction utility is defined and used to select features and rules. The generalized sequential decision tree (GSDT) method adopts a divide-and-conquer strategy to generate a classification model. We conduct an extensive empirical evaluation on several real data sets. Interestingly, our two methods achieve accuracy comparable to that of the stateof-the-art methods, but typically need to use only very short prefixes of the sequences. The results clearly indicate that early prediction is highly feasible and effective.

[1]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[2]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[3]  J. Valcárcel,et al.  Alternative pre-mRNA splicing: the logic of combinatorial control. , 2000, Trends in biochemical sciences.

[4]  William Noble Grundy,et al.  Meta-MEME: motif-based hidden Markov models of protein families , 1997, Comput. Appl. Biosci..

[5]  Gunnar Rätsch,et al.  Learning Interpretable SVMs for Biological Sequence Classification , 2005, BMC Bioinformatics.

[6]  Jude Shavlik,et al.  Refinement ofApproximate Domain Theories by Knowledge-Based Neural Networks , 1990, AAAI.

[7]  M. Kozak Compilation and analysis of sequences upstream from the translational start site in eukaryotic mRNAs. , 1984, Nucleic acids research.

[8]  C. Harley,et al.  Analysis of E. coli promoter sequences. , 1987, Nucleic acids research.

[9]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[10]  Ke Wang,et al.  Frequent-subsequence-based prediction of outer membrane proteins , 2003, KDD '03.

[11]  Sunita Sarawagi,et al.  Sequence Data Mining , 2005 .

[12]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[13]  Dennis Shasha,et al.  DNA sequence classification via an expectation maximization algorithm and neural networks: a case study , 2001, IEEE Trans. Syst. Man Cybern. Part C.

[14]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[15]  Jian Pei,et al.  CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[16]  Mohammed J. Zaki,et al.  Mining features for sequence classification , 1999, KDD '99.

[17]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[18]  Marlon Núñez The use of background knowledge in decision tree induction , 2004, Machine Learning.

[19]  Cathy H. Wu,et al.  Neural networks for full-scale protein sequence classification: Sequence encoding with singular value decomposition , 1995, Machine Learning.

[20]  S. Berget Exon Recognition in Vertebrate Splicing (*) , 1995, The Journal of Biological Chemistry.

[21]  Minoru Kanehisa,et al.  Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs , 2003, Bioinform..

[22]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[23]  Li Wei,et al.  Semi-supervised time series classification , 2006, KDD '06.

[24]  Eamonn J. Keogh,et al.  On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[25]  Chien-Yu Chen,et al.  Prediction of outer membrane proteins by support vector machines using combinations of gapped amino acid pair compositions , 2005, Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05).

[26]  Mong-Li Lee,et al.  Mining progressive confident rules , 2006, KDD '06.

[27]  Jiawei Han,et al.  CPAR: Classification based on Predictive Association Rules , 2003, SDM.

[28]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[29]  D. S. Prestridge Predicting Pol II promoter sequences using transcription factor binding sites. , 1995, Journal of molecular biology.

[30]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[31]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[32]  Juan José Rodríguez Diez,et al.  Boosting Interval-Based Literals: Variable Length and Early Classification , 2003 .

[33]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[34]  Carlos J. Alonso,et al.  Boosting Interval-Based Literals: Variable Length and Early Classification , 2002 .