论文信息 - Using substitution probabilities to improve position-specific scoring matrices

Using substitution probabilities to improve position-specific scoring matrices

Each column of amino acids in a multiple alignment of protein sequences can be represented as a vector of 20 amino acid counts. For alignment and searching applications, the count vector is an imperfect representation of a position, because the observed sequences are an incomplete sample of the full set of related sequences. One general solution to this problem is to model unobserved sequences by adding artificial 'pseudo-counts' to the observed counts. We introduce a simple method for computing pseudo-counts that combines the diversity observed in each alignment position with amino acid substitution probabilities. In extensive empirical tests, this position-based method out-performed other pseudo-count methods and was a substantial improvement over the traditional average score method used for constructing profiles.

Jorja G. Henikoff | Steven Henikoff | S. Henikoff | J. Henikoff

[1] C. Sander,et al. Yeast chromosome III: new gene functions. , 1994, The EMBO journal.

[2] P. Green,et al. Ancient conserved regions in gene sequences , 1994 .

[3] S. Henikoff,et al. Finding protein similarities with nucleotide sequence databases. , 1990, Methods in enzymology.

[4] G. Stormo. Consensus patterns in DNA. , 1990, Methods in enzymology.

[5] M. Sternberg,et al. Flexible protein sequence patterns. A sensitive method to detect weak structural similarities. , 1990, Journal of molecular biology.

[6] W. Pearson. Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. , 1991, Genomics.

[7] Julie Dawn Thompson,et al. Improved sensitivity of profile searches through the use of sequence weights and gap excision , 1994, Comput. Appl. Biosci..

[8] Martin Vingron,et al. A fast and sensitive multiple sequence alignment algorithm , 1989, Comput. Appl. Biosci..

[9] C. Metz. Basic principles of ROC analysis. , 1978, Seminars in nuclear medicine.

[10] D. Haussler,et al. Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[11] Ian B. Dodd,et al. Systematic method for the detection of potential λ Cro-like DNA-binding regions in proteins , 1987 .

[12] S. Henikoff,et al. Position-based sequence weights. , 1994, Journal of molecular biology.

[13] E. Sonnhammer,et al. Modular arrangement of proteins as inferred from analysis of homology , 1994, Protein science : a publication of the Protein Society.

[14] S. Henikoff,et al. Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[15] P. Bucher,et al. Improving the sensitivity of the sequence profile method , 1994, Protein science : a publication of the Protein Society.

[16] A. Bairoch. PROSITE: a dictionary of sites and patterns in proteins. , 1991, Nucleic acids research.

[17] B. Dujon,et al. The complete DNA sequence of yeast chromosome III , 1992, Nature.

[18] S. Altschul. Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[19] A. Bairoch,et al. The SWISS-PROT protein sequence data bank. , 1991, Nucleic acids research.

[20] M. O. Dayhoff,et al. Atlas of protein sequence and structure , 1965 .

[21] W. Pearson. Comparison of methods for searching protein sequence databases , 1995, Protein science : a publication of the Protein Society.

[22] David Haussler,et al. Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families , 1993, ISMB.

[23] S. Henikoff,et al. Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[24] A. Kerlavage,et al. Complementary DNA sequencing: expressed sequence tags and human genome project , 1991, Science.

[25] A. D. McLachlan,et al. Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[26] Jean-Michel Claverie,et al. Some Useful Statistical Properties of Position-weight Matrices , 1994, Comput. Chem..

[27] Hamilton O. Smith,et al. Finding sequence motifs in groups of functionally related proteins. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[28] S. Henikoff,et al. Automated construction and graphical presentation of protein blocks from unaligned sequences. , 1995, Gene.

[29] Stephen E. Fienberg,et al. Discrete Multivariate Analysis: Theory and Practice , 1976 .

[30] Jun S. Liu,et al. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[31] Michael F. Halasz,et al. Nih , 1997 .

[32] T. Attwood,et al. PRINTS--a protein motif fingerprint database. , 1994, Protein engineering.

[33] I. Dodd,et al. Systematic method for the detection of potential lambda Cro-like DNA-binding regions in proteins. , 1987, Journal of molecular biology.

[34] S. Henikoff,et al. Protein family classification based on searching a database of blocks. , 1994, Genomics.

[35] S. Altschul,et al. Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. , 1994, Proceedings of the National Academy of Sciences of the United States of America.