Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSI-BLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.

[1]  V. Georgiev Virology , 1955, Nature.

[2]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[3]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[4]  M. Scawen,et al.  The amino acid sequence of leghaemoglobin I from root nodules of broad bean (Vicia faba L.) , 1975, FEBS letters.

[5]  D. Huylebroeck,et al.  Complete structure of the hemagglutinin gene from the human influenza A/Victoria/3/75 (H3N2) strain as determined from cloned DNA , 1980, Cell.

[6]  G. Braunitzer,et al.  [Hemoglobins, XXXIII. Note on the Sequence of the hemoglobins of the horse (author's transl)]. , 1980, Hoppe-Seyler's Zeitschrift fur physiologische Chemie.

[7]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[8]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[9]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[10]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[11]  A. Mclachlan,et al.  Analysis of gene duplication repeats in the myosin rod. , 1983, Journal of molecular biology.

[12]  D. Lipman,et al.  Rapid similarity searches of nucleic acid and protein data banks. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[13]  T. Smith,et al.  Optimal sequence alignments. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[14]  P. Sellers Pattern recognition in genetic sequences by mismatch density , 1984 .

[15]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[16]  Temple F. Smith,et al.  The statistical distribution of nucleic acid similarities. , 1985, Nucleic acids research.

[17]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[18]  S. Altschul,et al.  Optimal sequence alignment using affine gap costs. , 1986, Bulletin of mathematical biology.

[19]  S F Altschul,et al.  Locally optimal subalignments using nonlinear similarity functions. , 1986, Bulletin of mathematical biology.

[20]  W. Taylor,et al.  Identification of protein sequence homology by consensus template alignment. , 1986, Journal of molecular biology.

[21]  R. Padmanabhan,et al.  Sequence analysis in the E1 region of adenovirus type 4 DNA. , 1986, Virology.

[22]  M. Waterman,et al.  A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. , 1987, Journal of molecular biology.

[23]  K. S. Arun,et al.  Least-Squares Fitting of Two 3-D Point Sets , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Ian B. Dodd,et al.  Systematic method for the detection of potential λ Cro-like DNA-binding regions in proteins , 1987 .

[25]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[26]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. , 1987, Journal of molecular biology.

[27]  L. Patthy,et al.  Detecting homology of distantly related proteins with consensus sequences. , 1987, Journal of molecular biology.

[28]  Eugene W. Myers,et al.  Optimal alignments in linear space , 1988, Comput. Appl. Biosci..

[29]  J. F. Collins,et al.  The significance of protein sequence similarities , 1988, Comput. Appl. Biosci..

[30]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[31]  S F Altschul,et al.  Weights for data related by a tree. , 1989, Journal of molecular biology.

[32]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[33]  J. Buhler,et al.  Isolation, characterization, and inactivation of the APA1 gene encoding yeast diadenosine 5',5'''-P1,P4-tetraphosphate phosphorylase , 1989, Journal of bacteriology.

[34]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[35]  P. Argos,et al.  Weighting aligned protein or nucleic acid sequences to correct for unequal representation. , 1990, Journal of molecular biology.

[36]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[37]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[38]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[39]  A. B. Robinson,et al.  Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[40]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[41]  D. Maskell,et al.  The gal locus from Haemophilus influenzae: cloning, sequencing and the use of gal mutants to study lipopolysaccharide , 1992, Molecular microbiology.

[42]  Richard Mott,et al.  Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores , 1992 .

[43]  Kun-Mao Chao,et al.  Aligning two sequences within a specified diagonal band , 1992, Comput. Appl. Biosci..

[44]  AC Tose Cell , 1993, Cell.

[45]  S. Karlin,et al.  Applications and statistics for multiple high-scoring segments in molecular sequences. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[46]  R. Heidenreich,et al.  Rat galactose-1-phosphate uridyltransferase coding sequence, transcription start site and genomic organization. , 1993, DNA sequence : the journal of DNA sequencing and mapping.

[47]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[48]  David Haussler,et al.  Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families , 1993, ISMB.

[49]  D. Trono,et al.  Vif is crucial for human immunodeficiency virus type 1 proviral DNA synthesis in infected cells , 1993, Journal of virology.

[50]  John C. Wootton,et al.  Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases , 1993, Comput. Chem..

[51]  Lawrence Hunter,et al.  Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology , 1993 .

[52]  Julie Dawn Thompson,et al.  Improved sensitivity of profile searches through the use of sequence weights and gap excision , 1994, Comput. Appl. Biosci..

[53]  E S Lander,et al.  Recognition of related proteins by iterative template refinement (ITR) , 1994, Protein science : a publication of the Protein Society.

[54]  Steven E. Bayer,et al.  A strong candidate for the breast and ovarian cancer susceptibility gene BRCA1. , 1994, Science.

[55]  S. Altschul,et al.  Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[56]  S. Altschul,et al.  Issues in searching molecular sequence databases , 1994, Nature Genetics.

[57]  S. Henikoff,et al.  Position-based sequence weights. , 1994, Journal of molecular biology.

[58]  A. Dembo,et al.  Limit Distribution of Maximal Non-Aligned Two-Sequence Segmental Score , 1994 .

[59]  C. Chothia,et al.  Volume changes in protein evolution. , 1994, Journal of molecular biology.

[60]  R. Durbin,et al.  2.2 Mb of contiguous nucleotide sequence from chromosome III of C. elegans , 1994, Nature.

[61]  Erik L. L. Sonnhammer,et al.  A workbench for large-scale sequence homology analysis , 1994, Comput. Appl. Biosci..

[62]  Sean R. Eddy,et al.  Maximum Discrimination Hidden Markov Models of Sequence Consensus , 1995, J. Comput. Biol..

[63]  Osamu Gotoh,et al.  A weighting system and algorithm for aligning many phylogenetically related sequences , 1995, Comput. Appl. Biosci..

[64]  Steven E. Brenner,et al.  Proceedings Of The Third International Conference On Intelligent Systems For Molecular Biology , 1995 .

[65]  A. Amsterdam,et al.  Insertional mutagenesis in zebrafish identifies two novel genes, pescadillo and dead eye, essential for embryonic development. , 1996, Genes & development.

[66]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[67]  Kevin Karplus,et al.  A Flexible Motif Search Technique Based on Generalized Profiles , 1996, Comput. Chem..

[68]  C. Croce,et al.  The FHIT Gene, Spanning the Chromosome 3p14.2 Fragile Site and Renal Carcinoma–Associated t(3;8) Breakpoint, Is Abnormal in Digestive Tract Cancers , 1996, Cell.

[69]  Alfonso Valencia,et al.  Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology , 1996 .

[70]  Y. Nakamura,et al.  Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions (supplement). , 1996, DNA research : an international journal for rapid publication of reports on genes and genomes.

[71]  Eugene V. Koonin,et al.  …Functional motifs… , 1996, Nature Genetics.

[72]  Jorja G. Henikoff,et al.  Using substitution probabilities to improve position-specific scoring matrices , 1996, Comput. Appl. Biosci..

[73]  Anne M. Bowcock,et al.  Identification of a RING protein that can interact in vivo with the BRCA1 gene product , 1996, Nature Genetics.

[74]  N. Nomura,et al.  Prediction of the coding sequences of unidentified human genes. VI. The coding sequences of 80 new genes (KIAA0201-KIAA0280) deduced by analysis of cDNA clones from cell line KG-1 and brain. , 1996, DNA research : an international journal for rapid publication of reports on genes and genomes.

[75]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[76]  J. Mornon,et al.  From BRCA1 to RAP1: a widespread BRCT module closely associated with DNA repair , 1997, FEBS letters.

[77]  S. Suhai Theoretical and Computational Methods in Genome Research , 2012, Springer US.

[78]  C Sander,et al.  New structure--novel fold? , 1997, Structure.

[79]  Peer Bork,et al.  A superfamily of conserved domains in DNA damage‐ responsive cell cycle checkpoint proteins , 1997, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[80]  Gapped BLAST and PSI-BLAST: A new , 1997 .

[81]  S. Henikoff,et al.  Embedding strategies for effective use of information from multiple sequence alignments , 1997, Protein science : a publication of the Protein Society.

[82]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[83]  T. Hope,et al.  Woodchuck Hepatitis Virus Posttranscriptional Regulatory Element Enhances Expression of Transgenes Delivered by Retroviral Vectors , 1999, Journal of Virology.