UPSEC: An Algorithm for Classifying Unaligned Protein Sequences into Functional Families

To classify proteins into functional families based on their primary sequences, popular algorithms such as the k-NN-, HMM-, and SVM-based algorithms are often used. For many of these algorithms to perform their tasks, protein sequences need to be properly aligned first. Since the alignment process can be error-prone, protein classification may not be performed very accurately. To improve classification accuracy, we propose an algorithm, called the Unaligned Protein SEquence Classifier (UPSEC), which can perform its tasks without sequence alignment. UPSEC makes use of a probabilistic measure to identify residues that are useful for classification in both positive and negative training samples, and can handle multi-class classification with a single classifier and a single pass through the training data. UPSEC has been tested with real protein data sets. Experimental results show that UPSEC can effectively classify unaligned protein sequences into their corresponding functional families, and the patterns it discovers during the training process can be biologically meaningful.

[1]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[2]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[3]  M. Sternberg,et al.  A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons. , 1987, Journal of molecular biology.

[4]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[5]  I. Good,et al.  Information, weight of evidence, the singularity between probability measures and signal detection , 1974 .

[6]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[7]  Sean R. Eddy,et al.  Maximum Discrimination Hidden Markov Models of Sequence Consensus , 1995, J. Comput. Biol..

[8]  Jiawei Han,et al.  Cancer classification using gene expression data , 2003, Inf. Syst..

[9]  Anders Krogh,et al.  Hidden Markov models for sequence analysis: extension and analysis of the basic method , 1996, Comput. Appl. Biosci..

[10]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[11]  Jonathan Pevsner,et al.  Bioinformatics and functional genomics , 2003 .

[12]  M. A. McClure,et al.  Comparative analysis of multiple protein-sequence alignment methods. , 1994, Molecular biology and evolution.

[13]  B. K. Agarwal,et al.  X-Ray Spectroscopy: An Introduction , 1979 .

[14]  Rodrigo Lopez,et al.  Multiple sequence alignment with the Clustal series of programs , 2003, Nucleic Acids Res..

[15]  Richard Hughey,et al.  Weighting hidden Markov models for maximum discrimination , 1998, Bioinform..

[16]  Smith Rf,et al.  Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for use in comparative protein modelling. , 1992 .

[17]  A. Sali,et al.  Protein Structure Prediction and Structural Genomics , 2001, Science.

[18]  J. Skolnick,et al.  From genes to protein structure and function: novel applications of computational approaches in the genomic era. , 2000, Trends in biotechnology.

[19]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[20]  Amos Bairoch,et al.  The ENZYME database in 2000 , 2000, Nucleic Acids Res..

[21]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[22]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[23]  S. Haberman The Analysis of Residuals in Cross-Classified Tables , 1973 .

[24]  Amos Bairoch,et al.  Recent improvements to the PROSITE database , 2004, Nucleic Acids Res..

[25]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[26]  D. Higgins,et al.  SAGA: sequence alignment by genetic algorithm. , 1996, Nucleic acids research.

[27]  Ram Samudrala,et al.  Ab initio protein structure prediction using a combined hierarchical approach , 1999, Proteins.

[28]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[29]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[30]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[31]  J. Thompson,et al.  The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. , 1997, Nucleic acids research.

[32]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[33]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[34]  Neil D. Rawlings,et al.  MEROPS: the peptidase database , 2009, Nucleic Acids Res..

[35]  Andreas D. Baxevanis,et al.  Bioinformatics - a practical guide to the analysis of genes and proteins , 2001, Methods of biochemical analysis.

[36]  W. Taylor A flexible method to align large numbers of biological sequences , 2005, Journal of Molecular Evolution.

[37]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[38]  George Karypis,et al.  Evaluation of Techniques for Classifying Biological Sequences , 2002, PAKDD.

[39]  Ram Samudrala,et al.  A Combined Approach for Ab Initio Construction of Low Resolution Protein Tertiary Structures from Sequence , 1999, Pacific Symposium on Biocomputing.

[40]  O. Gotoh Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. , 1996, Journal of molecular biology.

[41]  Andrew K. C. Wong,et al.  Learning sequential patterns for probabilistic inductive prediction , 1994 .

[42]  Burkhard Morgenstern,et al.  DIALIGN: finding local similarities by multiple sequence alignment , 1998, Bioinform..

[43]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .