An Effective Data Mining Technique for Classifying Unaligned Protein Sequences into Functional Families

To classify proteins into functional families based on their primary sequences, existing classification algorithms such as the k-NN, HMM and SVM-based algorithms are often used. For most of these algorithms to perform their tasks, protein sequences need to be properly aligned first. Since the alignment process is error-prone, protein classification may not be performed very accurately. In addition to the request for accurate alignment, many existing approaches require additional techniques to decompose a protein multi-class classification problem into a number of binary problems. This may slow the learning process when the number of classes being handled is large. For these reasons, we propose an effective data mining technique in this paper. This technique has been applied in real protein sequence classification tasks. Experimental results show that it can effectively classify unaligned protein sequences into corresponding functional families and the patterns it discovered during the training process have been found to be biologically meaningful. They can lead to better understanding of protein functions and can also allow functionally significant structural features of different protein families to be better characterized.

[1]  I. Good,et al.  Information, weight of evidence, the singularity between probability measures and signal detection , 1974 .

[2]  Amos Bairoch,et al.  The ENZYME database in 2000 , 2000, Nucleic Acids Res..

[3]  S. Haberman The Analysis of Residuals in Cross-Classified Tables , 1973 .

[4]  William R. Taylor,et al.  Protein bioinformatics - an algorithmic approach to sequence and structure analysis , 2004 .

[5]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[6]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[7]  Andrew K. C. Wong,et al.  Statistical Technique for Extracting Classificatory Knowledge from Databases , 1991, Knowledge Discovery in Databases.

[8]  J. Cavanagh Protein NMR Spectroscopy: Principles and Practice , 1995 .

[9]  Rodrigo Lopez,et al.  Multiple sequence alignment with the Clustal series of programs , 2003, Nucleic Acids Res..

[10]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[11]  Jonathan Pevsner,et al.  Bioinformatics and functional genomics , 2003 .

[12]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[13]  George Karypis,et al.  Evaluation of Techniques for Classifying Biological Sequences , 2002, PAKDD.

[14]  Jun Kong,et al.  MEROPS: the peptidase database. , 2004, Nucleic acids research.

[15]  Amos Bairoch,et al.  Recent improvements to the PROSITE database , 2004, Nucleic Acids Res..

[16]  Gerald Kowalski,et al.  Information Retrieval Systems: Theory and Implementation , 1997 .

[17]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[18]  B. K. Agarwal,et al.  X-Ray Spectroscopy: An Introduction , 1979 .

[19]  Philip E. Bourne,et al.  Structural Bioinformatics: Bourne/Structural Bioinformatics , 2005 .

[20]  Ram Samudrala,et al.  Ab initio protein structure prediction using a combined hierarchical approach , 1999, Proteins.

[21]  A. Sali,et al.  Protein Structure Prediction and Structural Genomics , 2001, Science.

[22]  J. Skolnick,et al.  From genes to protein structure and function: novel applications of computational approaches in the genomic era. , 2000, Trends in biotechnology.

[23]  Li Liao,et al.  Combining pairwise sequence similarity and support vector machines for remote protein homology detection , 2002, RECOMB '02.

[24]  Anders Krogh,et al.  Hidden Markov models for sequence analysis: extension and analysis of the basic method , 1996, Comput. Appl. Biosci..

[25]  Andrew K. C. Wong,et al.  Learning sequential patterns for probabilistic inductive prediction , 1994 .