Embedding strategies for effective use of information from multiple sequence alignments

We describe a new strategy for utilizing multiple sequence alignment information to detect distant relationships in searches of sequence databases. A single sequence representing a protein family is enriched by replacing conserved regions with position‐specific scoring matrices (PSSMs) or consensus residues derived from multiple alignments of family members. In comprehensive tests of these and other family representations, PSSM‐embedded queries produced the best results overall when used with a special version of the Smith‐Waterman searching algorithm. Moreover, embedding consensus residues instead of PSSMs improved performance with readily available single sequence query searching programs, such as BLAST and FASTA. Embedding PSSMs or consensus residues into a representative sequence improves searching performance by extracting multiple alignment information from motif regions while retaining single sequence information where alignment is uncertain.

[1]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[2]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[3]  L. Patthy,et al.  Detecting homology of distantly related proteins with consensus sequences. , 1987, Journal of molecular biology.

[4]  S Henikoff,et al.  A large family of bacterial activator proteins. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[5]  R J Roberts,et al.  Predictive motifs derived from cytosine methyltransferases. , 1989, Nucleic acids research.

[6]  S F Altschul,et al.  Protein database searches for multiple alignments. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[7]  M. Gribskov,et al.  [9] Profile analysis , 1990 .

[8]  S. Henikoff,et al.  Finding protein similarities with nucleotide sequence databases. , 1990, Methods in enzymology.

[9]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[10]  R. F. Smith,et al.  Automatic generation of primary sequence patterns from sets of related protein sequences. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[11]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[12]  A. Bairoch PROSITE: a dictionary of sites and patterns in proteins. , 1991, Nucleic acids research.

[13]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[14]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Steven Henikoff,et al.  PATMAT: a searching and extraction program for sequence, pattern and block queries and databases , 1992, Comput. Appl. Biosci..

[16]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank. , 1991, Nucleic acids research.

[17]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[18]  David Haussler,et al.  Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families , 1993, ISMB.

[19]  S Henikoff,et al.  Performance evaluation of amino acid substitution matrices , 1993, Proteins.

[20]  T. Attwood,et al.  PRINTS--a protein motif fingerprint database. , 1994, Protein engineering.

[21]  Julie Dawn Thompson,et al.  Improved sensitivity of profile searches through the use of sequence weights and gap excision , 1994, Comput. Appl. Biosci..

[22]  M. Gribskov,et al.  Profile Analysis , 1970 .

[23]  S. Pietrokovski,et al.  Conserved sequence features of inteins (protein introns) and their use in identifying new inteins and related proteins , 1994, Protein science : a publication of the Protein Society.

[24]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[25]  E S Lander,et al.  Recognition of related proteins by iterative template refinement (ITR) , 1994, Protein science : a publication of the Protein Society.

[26]  P. Bucher,et al.  Improving the sensitivity of the sequence profile method , 1994, Protein science : a publication of the Protein Society.

[27]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[28]  S. Altschul,et al.  Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[29]  A. F. Neuwald,et al.  Detecting patterns in protein sequences. , 1994, Journal of molecular biology.

[30]  S. Henikoff,et al.  Position-based sequence weights. , 1994, Journal of molecular biology.

[31]  E. Sonnhammer,et al.  Modular arrangement of proteins as inferred from analysis of homology , 1994, Protein science : a publication of the Protein Society.

[32]  S Henikoff,et al.  Comparative methods for identifying functional domains in protein sequences. , 1995, Biotechnology annual review.

[33]  R. F. Smith,et al.  BEAUTY: an enhanced BLAST-based search tool that integrates multiple biological information resources into sequence similarity search results. , 1995, Genome research.

[34]  W. Pearson Comparison of methods for searching protein sequence databases , 1995, Protein science : a publication of the Protein Society.

[35]  T. Stevens,et al.  Protein splicing: self-splicing of genetically mobile elements at the protein level. , 1995, Trends in biochemical sciences.

[36]  S. Henikoff,et al.  Automated construction and graphical presentation of protein blocks from unaligned sequences. , 1995, Gene.

[37]  R. Nowak Bacterial genome sequence bagged. , 1995, Science.

[38]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[39]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[40]  M. Gribskov,et al.  Identification of Sequence Patterns with Profile Analysis , 1996 .

[41]  Cathy H. Wu,et al.  Motif identification neural design for rapid and sensitive protein family search , 1996, Comput. Appl. Biosci..

[42]  S. Pietrokovski Searching databases of conserved sequence regions by aligning protein multiple-alignments. , 1996, Nucleic acids research.

[43]  Jorja G. Henikoff,et al.  Using substitution probabilities to improve position-specific scoring matrices , 1996, Comput. Appl. Biosci..

[44]  M. Gribskov,et al.  [13] Identification of sequence patterns with profile analysis , 1996 .

[45]  Michael Gribskov,et al.  The Megaprior Heuristic for Discovering Protein Sequence Patterns , 1996, ISMB.