Automatic Identification and Classification of Protein Domains

Motivation: Proteins are comprised of one or several domains. Such domains can be classified into families according to their biological function. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational tools for large-scale determination of protein domains and their boundaries. The present paper addresses the challenge of developing computational tools to identify protein domains and to classify them into their families. The eventual goal of our research is to automatically identify and classify correctly all protein domains. Results: Our method, called EVEREST, combines methodologies from the fields of finite metric spaces, machine learning and statistical modeling and achieves state of the art results. Our process begins by constructing a database of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments, choosing the best clusters using machine learning techniques, and creating a statistical model for each of the these clusters. This procedure is then iterated: The aforementioned statistical models are used to scan all protein sequences, to recreate a segment database and to cluster them again. Performance tests show that EVEREST recovers 63% of Pfam families and 40% of SCOP families with high accuracy, and suggests new families with about 40% fidelity. EVEREST domains are frequently a combination of domains as defined by Pfam or SCOP and frequently subdomains of such domains. The paper is concluded with a discussion of research avenues to improve these results. Availability: A database of statistical models (HMMER HMMs), one per domain family is available for download at http://www.cs.huji.ac.il/ elonp/everest.

[1]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[2]  Tim J. P. Hubbard,et al.  SCOP: a Structural Classification of Proteins database , 1999, Nucleic Acids Res..

[3]  P Argos,et al.  DOMO: a new database of aligned protein domains. , 1998, Trends in biochemical sciences.

[4]  Burkhard Rost,et al.  Domains, motifs and clusters in the protein universe. , 2003, Current opinion in chemical biology.

[5]  Ori Sasson,et al.  ProtoNet: hierarchical classification of the protein space , 2003, Nucleic Acids Res..

[6]  L. Holm,et al.  Exhaustive enumeration of protein domain families. , 2003, Journal of molecular biology.

[7]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[8]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[9]  Ori Sasson,et al.  The metric space of proteins-comparative study of clustering algorithms , 2002, ISMB.

[10]  Ruth Nussinov,et al.  Protein structure prediction via combinatorial assembly of sub-structural units , 2003, ISMB.

[11]  Yoram Singer,et al.  Smooth ε-insensitive regression by loss symmetrization , 2003, COLT 2003.

[12]  Michal Linial,et al.  A functional hierarchical organization of the protein sequence space , 2004, BMC Bioinformatics.

[13]  Elon Portugaly,et al.  HMMERHEAD-Accelerating HMM Searches On Large Databases , 2004 .

[14]  Yoram Singer,et al.  Smooth e-Intensive Regression by Loss Symmetrization , 2005, COLT.

[15]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[16]  Peer Bork,et al.  SMART: a web-based tool for the study of genetically mobile domains , 2000, Nucleic Acids Res..

[17]  Golan Yona,et al.  Automatic prediction of protein domains from sequence information using a hybrid learning system , 2004, Bioinform..

[18]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[19]  Patrice Koehl,et al.  The ASTRAL Compendium in 2004 , 2003, Nucleic Acids Res..

[20]  Alex Bateman,et al.  The InterPro Database, 2003 brings increased coverage and new features , 2003, Nucleic Acids Res..

[21]  Burkhard Rost,et al.  CHOP: parsing proteins into structural domains , 2004, Nucleic Acids Res..