Biosequence Classification using Sequential Pattern Mining and Optimization

In this paper we present a methodology for biosequence classification, which employs sequential pattern mining and optimization algorithms. In the first stage, a sequential pattern mining algorithm is applied to a set of biological sequences and the sequential patterns are extracted. Then, the score of each pattern with respect to each sequence is calculated using a scoring function and the score of each class under consideration is estimated. The scores of the patterns and classes are updated, multiplied by a weight. In the second stage an optimization technique is employed to calculate the weight values to achieve the optimal classification accuracy. The methodology is applied in the protein class and fold prediction problem. Extensive evaluation is carried out, using a dataset obtained from the Protein Data Bank.

[1]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[2]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[3]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[4]  Dimitris G. Papageorgiou,et al.  MERLIN-3.1.1. A new version of the Merlin optimization environment , 2004 .

[5]  Kyuseok Shim,et al.  SPIRIT: Sequential Pattern Mining with Regular Expression Constraints , 1999, VLDB.

[6]  C. Lampros,et al.  Protein Classification using Sequential Pattern Mining , 2006, 2006 International Conference of the IEEE Engineering in Medicine and Biology Society.

[7]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[8]  Mohammed J. Zaki,et al.  Scalable Feature Mining for Sequential Data , 2000, IEEE Intell. Syst..

[9]  Vincent S. Tseng,et al.  CBS: A New Classification Method by Using Sequential Patterns , 2005, SDM.

[10]  H. Hirsh,et al.  Maximum A posteriori classification of DNA structure from sequence information. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[11]  Dimitrios I. Fotiadis,et al.  Mining sequential patterns for protein fold recognition , 2008, J. Biomed. Informatics.

[12]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[13]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[14]  Jeffrey Xu Yu,et al.  Scalable sequential pattern mining for biological sequences , 2004, CIKM '04.

[15]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[16]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[17]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[18]  Mohammed J. Zaki Sequence mining in categorical domains: incorporating constraints , 2000, CIKM '00.