An initial strategy for comparing proteins at the domain architecture level

MOTIVATION Ideally, only proteins that exhibit highly similar domain architectures should be compared with one another as homologues or be classified into a single family. By combining three different indices, the Jaccard index, the Goodman-Kruskal gamma function and the domain duplicate index, into a single similarity measure, we propose a method for comparing proteins based on their domain architectures. RESULTS Evaluation of the method using the eukaryotic orthologous groups of proteins (KOGs) database indicated that it allows the automatic and efficient comparison of multiple-domain proteins, which are usually refractory to classic approaches based on sequence similarity measures. As a case study, the PDZ and LRR_1 domains are used to demonstrate how proteins containing promiscuous domains can be clearly compared using our method. For the convenience of users, a web server was set up where three different query interfaces were implemented to compare different domain architectures or proteins with domain(s), and to identify the relationships among domain architectures within a given KOG from the Clusters of Orthologous Groups of Proteins database. CONCLUSION The approach we propose is suitable for estimating the similarity of domain architectures of proteins, especially those of multidomain proteins. AVAILABILITY http://cmb.bnu.edu.cn/pdart/.

[1]  S. Teichmann,et al.  Supra-domains: evolutionary units larger than single protein domains. , 2004, Journal of molecular biology.

[2]  L. Hood,et al.  Gene families: the taxonomy of protein paralogs and chimeras. , 1997, Science.

[3]  C. Ponting,et al.  Sequence analysis of multidomain proteins: past perspectives and future directions. , 2002, Advances in protein chemistry.

[4]  C. Branden,et al.  Introduction to protein structure , 1991 .

[5]  A. Valencia,et al.  Intrinsic errors in genome annotation. , 2001, Trends in genetics : TIG.

[6]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[7]  Michael Y. Galperin,et al.  The COG database: a tool for genome-scale analysis of protein functions and evolution , 2000, Nucleic Acids Res..

[8]  Alex Bateman,et al.  The InterPro Database, 2003 brings increased coverage and new features , 2003, Nucleic Acids Res..

[9]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[10]  E. Koonin,et al.  The Impact of Comparative Genomics on Our Understanding of Evolution , 2000, Cell.

[11]  M. Gerstein,et al.  The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. , 1999, Journal of molecular biology.

[12]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[13]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[14]  Chris P Ponting,et al.  Genome cartography through domain annotation , 2001, Genome Biology.

[15]  Tim J. P. Hubbard,et al.  SCOP database in 2002: refinements accommodate structural genomics , 2002, Nucleic Acids Res..

[16]  C. Chothia Proteins. One thousand families for the molecular biologist. , 1992, Nature.

[17]  C. Ponting,et al.  The natural history of protein domains. , 2002, Annual review of biophysics and biomolecular structure.

[18]  P. Bork,et al.  Protein domain analysis in the era of complete genomes , 2002, FEBS letters.

[19]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[20]  C P Ponting,et al.  Evidence for PDZ domains in bacteria, yeast, and plants , 1997, Protein science : a publication of the Protein Society.

[21]  Burkhard Rost,et al.  Domains, motifs and clusters in the protein universe. , 2003, Current opinion in chemical biology.

[22]  Cyrus Chothia,et al.  The SUPERFAMILY database in 2004: additions and improvements , 2004, Nucleic Acids Res..

[23]  John B. Anderson,et al.  CDD: a curated Entrez database of conserved domain alignments , 2003, Nucleic Acids Res..

[24]  Peer Bork,et al.  Recent improvements to the SMART domain-based sequence annotation resource , 2002, Nucleic Acids Res..

[25]  Sarah A. Teichmann,et al.  An insight into domain combinations , 2001, ISMB.

[26]  Mark Gerstein,et al.  The Relationship Between Protein Structure and Function: A Comprehensive Survey Focusing on Enzymes , 1999 .

[27]  J. Gough The SUPERFAMILY database in structural genomics. , 2002, Acta crystallographica. Section D, Biological crystallography.

[28]  E. Koonin,et al.  The structure of the protein universe and genome evolution , 2002, Nature.

[29]  C P Ponting,et al.  PDZ Domains: Targeting signalling molecules to sub‐membranous sites , 1997, BioEssays : news and reviews in molecular, cellular and developmental biology.

[30]  Alex Bateman,et al.  InterPro: An Integrated Documentation Resource for Protein Families, Domains and Functional Sites , 2002, Briefings Bioinform..

[31]  M. Gerstein,et al.  Annotation transfer for genomics: measuring functional divergence in multi-domain proteins. , 2001, Genome research.

[32]  Anton J. Enright,et al.  Classification schemes for protein structure and function , 2003, Nature Reviews Genetics.

[33]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[34]  Pedro Mendes,et al.  ISYS: a decentralized, component-based approach to the integration of heterogeneous bioinformatics resources , 2001, Bioinform..

[35]  P. Bork,et al.  Predicting functions from protein sequences—where are the bottlenecks? , 1998, Nature Genetics.

[36]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[37]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[38]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[39]  E. Koonin,et al.  Scale-free networks in biology: new insights into the fundamentals of evolution? , 2002, BioEssays : news and reviews in molecular, cellular and developmental biology.

[40]  S. Brenner Errors in genome annotation. , 1999, Trends in genetics : TIG.

[41]  E V Koonin,et al.  Estimating the number of protein folds and families from complete genome data. , 2000, Journal of molecular biology.