Automated protein classification using consensus decision

We propose a novel technique for automatically generating the SCOP classification of a protein structure with high accuracy. High accuracy is achieved by combining the decisions of multiple methods using the consensus of a committee (or an ensemble) classifier. Our technique is rooted in machine learning which shows that by judicially employing component classifiers, an ensemble classifier can be constructed to outperform its components. We use two sequence- and three structure-comparison tools as component classifiers. Given a protein structure, using the joint hypothesis, we first determine if the protein belongs to an existing category (family, superfamily, fold) in the SCOP hierarchy. For the proteins that are predicted as members of the existing categories, we compute their family-, superfamily-, and fold-level classifications using the consensus classifier. We show that we can significantly improve the classification accuracy compared to the individual component classifiers. In particular, we achieve error rates that are 3-12 times less than the individual classifiers' error rates at the family level, 1.5-4.5 times less at the superfamily level, and 1.1-2.4 times less at the fold level.

[1]  J. Gough The SUPERFAMILY database in structural genomics. , 2002, Acta crystallographica. Section D, Biological crystallography.

[2]  David G. Stork,et al.  Pattern Classification , 1973 .

[3]  P E Bourne,et al.  An alternative view of protein fold space , 2000, Proteins.

[4]  Eytan Domany,et al.  Automated assignment of SCOP and CATH protein structure classifications from FSSP scores , 2002, Proteins.

[5]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[6]  S. Bryant,et al.  Threading a database of protein cores , 1995, Proteins.

[7]  J Lundström,et al.  Pcons: A neural‐network–based consensus predictor that improves fold recognition , 2001, Protein science : a publication of the Protein Society.

[8]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[9]  E. Lindahl,et al.  Identification of related proteins on family, superfamily and fold level. , 2000, Journal of molecular biology.

[10]  David Chiu,et al.  BOOK REVIEW: "PATTERN CLASSIFICATION", R. O. DUDA, P. E. HART and D. G. STORK, Second Edition , 2001 .

[11]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[12]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[13]  S F Altschul,et al.  Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases. , 1998, Trends in biochemical sciences.

[14]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[15]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[16]  M. Linial,et al.  Estimating the probability for a protein to have a new fold: A statistical computational model. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Gunnar Rätsch,et al.  Advanced Lectures on Machine Learning , 2004, Lecture Notes in Computer Science.

[18]  C Sander,et al.  Mapping the Protein Universe , 1996, Science.