Assessment of a Rigorous Transitive Profile Based Search Method to Detect Remotely Similar Proteins

Abstract Profile-based sequence search procedures are commonly employed to detect remote relationships between proteins. We provide an assessment of a Cascade PSI-BLAST protocol that rigorously employs intermediate sequences in detecting remote relationships between proteins. In this approach we detect using PSI-BLAST, which involves multiple rounds of iteration, an initial set of homologues for a protein in a ‘first generation’ search by querying a database. We propagate a ‘second generation’ search in the database, involving multiple runs of PSI-BLAST using each of the homologues identified in the previous generation as queries to recognize homologues not detected earlier. This non-directed search process can be viewed as an iteration of iterations that is continued to detect further homologues until no new hits are detectable. We present an assessment of the coverage of this ‘cascaded’ intermediate sequence search on diverse folds and find that searches for up to three generations detect most known homologues of a query. Our assessments show that this approach appears to perform better than the traditional use of PSI-BLAST by detecting 15% more relationships within a family and 35% more relationships within a superfamily. We show that such searches can be performed on generalized sequence databases and non-trivial relationships between proteins can be detected effectively. Such a propagation of searches maximizes the chances of detecting distant homologies by effectively scanning protein “fold space”.

[1]  Peer Bork,et al.  Divergent evolution of a β/α‐barrel subclass: Detection of numerous phosphate‐binding sites by motif search , 1995 .

[2]  M. Sternberg,et al.  Benchmarking PSI-BLAST in genome annotation. , 1999, Journal of molecular biology.

[3]  Robert B Russell,et al.  Fold recognition without folds , 2002, Protein science : a publication of the Protein Society.

[4]  B. Höcker,et al.  A common evolutionary origin of two elementary enzyme folds , 2002, FEBS letters.

[5]  R. Abagyan,et al.  Do aligned sequences share the same fold? , 1997, Journal of molecular biology.

[6]  T. P. Flores,et al.  Identification and classification of protein fold families. , 1993, Protein engineering.

[7]  P. Bork,et al.  Predicting functions from protein sequences—where are the bottlenecks? , 1998, Nature Genetics.

[8]  Sean R. Eddy,et al.  Maximum Discrimination Hidden Markov Models of Sequence Consensus , 1995, J. Comput. Biol..

[9]  Sequence-based detection of distantly related proteins with the same fold. , 2001, Protein engineering.

[10]  Adam Godzik,et al.  Saturated BLAST: an automated multiple intermediate sequence search used to detect distant homology , 2000, Bioinform..

[11]  S V Evans,et al.  SETOR: hardware-lighted three-dimensional solid model representations of macromolecules. , 1993, Journal of molecular graphics.

[12]  E S Lander,et al.  Recognition of related proteins by iterative template refinement (ITR) , 1994, Protein science : a publication of the Protein Society.

[13]  A. Godzik,et al.  Sequence clustering strategies improve remote homology recognitions while reducing search times. , 2002, Protein engineering.

[14]  S. Remington,et al.  Crystal structure of Escherichia coli malate synthase G complexed with magnesium and glyoxylate at 2.0 A resolution: mechanistic implications. , 2000, Biochemistry.

[15]  Nick V Grishin,et al.  Using protein design for homology detection and active site searches , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Golan Yona,et al.  Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. , 2002, Journal of molecular biology.

[17]  C. Chothia,et al.  Intermediate sequences increase the detection of homology between sequences. , 1997, Journal of molecular biology.

[18]  Peter D Karp,et al.  The past, present and future of genome-wide re-annotation , 2002, Genome Biology.

[19]  J. Fetrow,et al.  Sequence- and structure-based protein function prediction from genomic information. , 2001, Current opinion in drug discovery & development.

[20]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[21]  Sarah A. Teichmann,et al.  Fast assignment of protein structures to sequences using the Intermediate Sequence Library PDB-ISL , 2000, Bioinform..

[22]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[23]  S. Altschul,et al.  Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[24]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[25]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[26]  A. Sali,et al.  Detection of homologous proteins by an intermediate sequence search , 2004, Protein science : a publication of the Protein Society.

[27]  S. Balaji,et al.  Integration of related sequences with protein three-dimensional structural families in an updated version of PALI database , 2003, Nucleic Acids Res..

[28]  M. Levitt,et al.  De novo protein design. I. In search of stability and specificity. , 1999, Journal of molecular biology.

[29]  M. A. McClure,et al.  Hidden Markov models of biological primary sequence information. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[30]  A. Godzik,et al.  Comparison of sequence profiles. Strategies for structural predictions using sequence information , 2008, Protein science : a publication of the Protein Society.

[31]  J. Garnier,et al.  Fold recognition using predicted secondary structure sequences and hidden Markov models of protein folds , 1997, Proteins.

[32]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[33]  C Sander,et al.  An evolutionary treasure: unification of a broad set of amidohydrolases related to urease , 1997, Proteins.

[34]  S. Balaji,et al.  PALI - a database of Phylogeny and ALIgnment of homologous protein structures , 2001, Nucleic Acids Res..

[35]  C Sander,et al.  Mapping the Protein Universe , 1996, Science.

[36]  Sean R. Eddy,et al.  Pfam: multiple sequence alignments and HMM-profiles of protein domains , 1998, Nucleic Acids Res..

[37]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[38]  Mark Gerstein,et al.  Measurement of the effectiveness of transitive sequence comparison, through a third 'intermediate' sequence , 1998, Bioinform..

[39]  A. Lesk,et al.  Determinants of a protein fold. Unique features of the globin amino acid sequences. , 1987, Journal of molecular biology.

[40]  S. Balaji,et al.  SUPFAM: A database of sequence superfamilies of protein domains , 2004, BMC Bioinformatics.

[41]  M. Levitt,et al.  De novo protein design. II. Plasticity in sequence space. , 1999, Journal of molecular biology.

[42]  Saikat Chakrabarti,et al.  Regions of minimal structural variation among members of protein domain superfamilies: application to remote homology detection and modelling using distant relationships , 2004, FEBS letters.

[43]  W. Taylor,et al.  Identification of protein sequence homology by consensus template alignment. , 1986, Journal of molecular biology.

[44]  Vijay S Pande,et al.  Thoroughly sampling sequence space: Large‐scale protein design of structural ensembles , 2002, Protein science : a publication of the Protein Society.

[45]  P. Bucher,et al.  Improving the sensitivity of the sequence profile method , 1994, Protein science : a publication of the Protein Society.

[46]  N. Grishin,et al.  Double‐stranded DNA bacteriophage prohead protease is homologous to herpesvirus protease , 2004, Protein science : a publication of the Protein Society.

[47]  M Wilmanns,et al.  Structural evidence for evolution of the beta/alpha barrel scaffold by gene duplication and fusion. , 2000, Science.

[48]  R. Sowdhamini,et al.  Effective detection of remote homologues by searching in sequence dataset of a protein domain fold , 2003, FEBS letters.

[49]  A. Murzin How far divergent evolution goes in proteins. , 1998, Current opinion in structural biology.

[50]  G. Barton,et al.  Multiple protein sequence alignment from tertiary structure comparison: Assignment of global and residue confidence levels , 1992, Proteins.

[51]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[52]  M. Levitt,et al.  Improved recognition of native-like protein structures using a family of designed sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[53]  E. Koonin,et al.  Gleaning non-trivial structural, functional and evolutionary information about proteins by iterative database searches. , 1999, Journal of molecular biology.

[54]  N. Grishin Fold change in evolution of protein structures. , 2001, Journal of structural biology.

[55]  A Bairoch,et al.  The SWISS-PROT protein sequence database: its relevance to human molecular medical research. , 1997, Journal of molecular medicine.

[56]  Alejandro A. Schäffer,et al.  IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices , 1999, Bioinform..

[57]  Arne Elofsson,et al.  Using evolutionary information for the query and target improves fold recognition , 2004, Proteins.

[58]  M. Saier,et al.  The IUBMB-endorsed transporter classification system. , 2004, Molecular biotechnology.

[59]  C A Orengo,et al.  Combining sensitive database searches with multiple intermediates to detect distant homologues. , 1999, Protein engineering.

[60]  P. Bork,et al.  Homology among (betaalpha)(8) barrels: implications for the evolution of metabolic pathways. , 2000, Journal of molecular biology.

[61]  D R Flower,et al.  The lipocalin protein family: structure and function. , 1996, The Biochemical journal.

[62]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[63]  T. Izard,et al.  Crystal structures of the metal‐dependent 2‐dehydro‐3‐deoxy‐galactarate aldolase suggest a novel reaction mechanism , 2000, The EMBO journal.