Issues in searching molecular sequence databases

Sequence similarity search programs are versatile tools for the molecular biologist, frequently able to identify possible DNA coding regions and to provide clues to gene and protein structure and function. While much attention had been paid to the precise algorithms these programs employ and to their relative speeds, there is a constellation of associated issues that are equally important to realize the full potential of these methods. Here, we consider a number of these issues, including the choice of scoring systems, the statistical significance of alignments, the masking of uninformative or potentially confounding sequence regions, the nature and extent of sequence redundancy in the databases and network access to similarity search services.

[1]  E. J. Gumbel,et al.  Statistics of Extremes. , 1960 .

[2]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[3]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[4]  A. Mclachlan Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c 551 . , 1971, Journal of molecular biology.

[5]  P. Sellers On the Theory and Computation of Evolutionary Distances , 1974 .

[6]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[7]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[8]  M. I. Kanehisa,et al.  Pattern recognition in nucleic acid sequences. I. A general method for finding local homologies and symmetries , 1982, Nucleic Acids Res..

[9]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[10]  T. Smith,et al.  Optimal sequence alignments. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[11]  P. Sellers Pattern recognition in genetic sequences by mismatch density , 1984 .

[12]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[13]  Temple F. Smith,et al.  The statistical distribution of nucleic acid similarities. , 1985, Nucleic acids research.

[14]  W J Wilbur,et al.  On the PAM matrix model of protein evolution. , 1985, Molecular biology and evolution.

[15]  Bruce W. Erickson,et al.  Optimal sequence alignment using affine gap costs , 1986 .

[16]  G. H. Hamm,et al.  The EMBL data library , 1993, Nucleic Acids Res..

[17]  Michael S. Waterman,et al.  An Extreme Value Theory for Sequence Matching , 1986 .

[18]  S F Altschul,et al.  A nonlinear measure of subalignment similarity and its significance levels. , 1986, Bulletin of mathematical biology.

[19]  J. F. Collins,et al.  Protein and Nucleic Acid Sequence Database Searching: A Suitable Case for Parallel processing , 1987, Comput. J..

[20]  P Argos,et al.  A sensitive procedure to compare amino acid sequences. , 1987, Journal of molecular biology.

[21]  M. Waterman,et al.  Phase transitions in sequence matches and nucleic acid structure. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[22]  M. Waterman,et al.  A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. , 1987, Journal of molecular biology.

[23]  J. Risler,et al.  Amino acid substitutions in structurally related proteins. A pattern recognition approach. Determination of a new and efficient scoring matrix. , 1988, Journal of molecular biology.

[24]  Eugene W. Myers,et al.  Optimal alignments in linear space , 1988, Comput. Appl. Biosci..

[25]  S Karlin,et al.  Charge configurations in viral proteins. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[26]  J. F. Collins,et al.  The significance of protein sequence similarities , 1988, Comput. Appl. Biosci..

[27]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[28]  E. Myers,et al.  Sequence comparison with concave weighting functions. , 1988, Bulletin of mathematical biology.

[29]  M. Waterman,et al.  Stochastic scrabble: large deviations for sequences with scores , 1988, Journal of Applied Probability.

[30]  M. Waterman,et al.  THE ERDOS-RENYI STRONG LAW FOR PATTERN MATCHING WITH A GIVEN PROPORTION OF MISMATCHES , 1989 .

[31]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[32]  S F Altschul,et al.  Protein database searches for multiple alignments. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Amir Dembo,et al.  Statistical Composition of High-Scoring Segments from Molecular Sequences , 1990 .

[34]  J. Schwabe,et al.  Solution structure of the DNA-binding domain of the oestrogen receptor. , 1990, Nature.

[35]  J. Wootton,et al.  Construction of validated, non-redundant composite protein sequence databases. , 1990, Protein engineering.

[36]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[37]  D. Lipman,et al.  National Center for Biotechnology Information , 2019, Springer Reference Medizin.

[38]  W. Pearson Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. , 1991, Genomics.

[39]  A. Kerlavage,et al.  Complementary DNA sequencing: expressed sequence tags and human genome project , 1991, Science.

[40]  S. Hanks,et al.  Protein kinase catalytic domain sequence database: identification of conserved features of primary structure and classification of family members. , 1991, Methods in enzymology.

[41]  J. Cohen,et al.  Identification of mRNAs associated with programmed cell death in immature thymocytes , 1991, Molecular and cellular biology.

[42]  B. Wang,et al.  The nucleosomal core histone octamer at 3.1 A resolution: a tripartite protein assembly and a left-handed superhelix. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[43]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[44]  S. Altschul,et al.  Improved Sensitivity of Nucleic Acid Database Searches Using Application-Specific Scoring Matrices , 1991 .

[45]  Amir Dembo,et al.  Strong limit theorems of empirical functionals for large exceedances of partial sums of i , 1991 .

[46]  S Karlin,et al.  Methods and algorithms for statistical analysis of protein sequences. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[47]  P. Argos,et al.  Searching for distantly related protein sequences in large databases by parallel processing on a transputer machine , 1992, Comput. Appl. Biosci..

[48]  S. Karlin,et al.  Chance and statistical significance in protein and DNA sequence analysis. , 1992, Science.

[49]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[50]  Richard Mott,et al.  Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores , 1992 .

[51]  P. Argos,et al.  Analysis of insertions/deletions in protein structures. , 1992, Journal of molecular biology.

[52]  John P. Overington,et al.  Environment‐specific amino acid substitution tables: Tertiary templates and prediction of protein folds , 1992, Protein science : a publication of the Protein Society.

[53]  G. Riggins,et al.  Human genes containing polymorphic trinucleotide repeats , 1992, Nature Genetics.

[54]  R. Jones Sequence pattern matching on a massively parallel computer , 1992, Comput. Appl. Biosci..

[55]  R. Harding,et al.  The evolution of tandemly repetitive DNA: recombination rules. , 1992, Genetics.

[56]  Peter Salamon,et al.  A Maximum Entropy Principle for the Distribution of Local Complexity in Naturally Occurring Nucleotide Sequences , 1992, Comput. Chem..

[57]  G. Gonnet,et al.  Exhaustive matching of the entire protein sequence database. , 1992, Science.

[58]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[59]  R. Tjian,et al.  Molecular cloning and functional analysis of Drosophila TAF110 reveal properties expected of coactivators , 1993, Cell.

[60]  P. Green,et al.  Ancient conserved regions in new gene sequences and the protein databases. , 1993, Science.

[61]  M. Ohki,et al.  The t(8;21) translocation in acute myeloid leukemia results in production of an AML1‐MTG8 fusion transcript. , 1993, The EMBO journal.

[62]  Lars Kai Hansen,et al.  On the Robustness of Maximum Entropy Relationships for Complexity Distributions of Nucleotide Sequences , 1993, Comput. Chem..

[63]  J. Claverie,et al.  Detecting frame shifts by amino acid sequence comparison. , 1993, Journal of molecular biology.

[64]  L. Feig The many roads that lead to Ras. , 1993, Science.

[65]  Douglas L. Brutlag,et al.  BLAZETM: An Implementation of the Smith-Waterman Sequence Comparison Algorithm on a Massively Parallel Computer , 1993, Comput. Chem..

[66]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[67]  Hans-Werner Mewes,et al.  The PIR-International databases , 1993, Nucleic Acids Res..

[68]  S Henikoff,et al.  Sequence analysis by electronic mail server. , 1993, Trends in biochemical sciences.

[69]  S. Karlin,et al.  Applications and statistics for multiple high-scoring segments in molecular sequences. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[70]  M. Sawicki,et al.  Human Genome Project. , 1993, American journal of surgery.

[71]  Jean-Michel Claverie,et al.  Information Enhancement Methods for Large Scale Sequence Analysis , 1993, Comput. Chem..

[72]  C. Auffray,et al.  Finding new genes faster than ever , 1993, Nature Genetics.

[73]  Kevin Davies,et al.  The EST express gathers speed , 1993, Nature.

[74]  T. Pawson,et al.  The SH2 and SH3 domains of mammalian Grb2 couple the EGF receptor to the Ras activator mSos1 , 1993, Nature.

[75]  J. Thompson,et al.  The PH domain: a common piece in the structural patchwork of signalling proteins. , 1993, Trends in biochemical sciences.

[76]  S Henikoff,et al.  Performance evaluation of amino acid substitution matrices , 1993, Proteins.

[77]  M. Horikoshi,et al.  The Drosophila 110-kDa transcription factor TFIID subunit directly interacts with the N-terminal region of the 230-kDa subunit. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[78]  D. Galas,et al.  A new five-year plan for the U.S. Human Genome Project. , 1993, Science.

[79]  Mark S. Boguski,et al.  Proteins regulating Ras and its relatives , 1993, Nature.

[80]  F. McCormick How receptors turn Ras on , 1993, Nature.

[81]  David J. States,et al.  Identification of protein coding regions by database similarity search , 1993, Nature Genetics.

[82]  G. Gonnet,et al.  Empirical and structural models for insertions and deletions in the divergent evolution of proteins. , 1993, Journal of molecular biology.

[83]  Amos Bairoch,et al.  The SWISS-PROT protein sequence data bank, recent developments , 1993, Nucleic Acids Res..

[84]  John C. Wootton,et al.  Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases , 1993, Comput. Chem..