Splitting the BLOSUM Score into Numbers of Biological Significance

Mathematical tools developed in the context of Shannon information theory were used to analyze the meaning of the BLOSUM score, which was split into three components termed as the BLOSUM spectrum (or BLOSpectrum). These relate respectively to the sequence convergence (the stochastic similarity of the two protein sequences), to the background frequency divergence (typicality of the amino acid probability distribution in each sequence), and to the target frequency divergence (compliance of the amino acid variations between the two sequences to the protein model implicit in the BLOCKS database). This treatment sharpens the protein sequence comparison, providing a rationale for the biological significance of the obtained score, and helps to identify weakly related sequences. Moreover, the BLOSpectrum can guide the choice of the most appropriate scoring matrix, tailoring it to the evolutionary divergence associated with the two sequences, or indicate if a compositionally adjusted matrix could perform better.

[1]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[2]  A. Finkelstein,et al.  From analysis of protein structural alignments toward a novel approach to align protein sequences , 2004, Proteins.

[3]  K S Wilson,et al.  Evolutionary divergence and conservation of trypsin. , 1994, Protein engineering.

[4]  Hughes Al,et al.  Evolutionary diversification of the mammalian defensins. , 1999 .

[5]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[6]  A. Hughes,et al.  Evolutionary diversification of the mammalian defensins , 1999, Cellular and Molecular Life Sciences CMLS.

[7]  J. Gates Introduction to Probability and its Applications , 1992 .

[8]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[10]  Alejandro A. Schäffer,et al.  IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices , 1999, Bioinform..

[11]  A. Mclachlan Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c 551 . , 1971, Journal of molecular biology.

[12]  S. Altschul A protein alignment scoring system sensitive at all evolutionary distances , 1993, Journal of Molecular Evolution.

[13]  P. Sellers On the Theory and Computation of Evolutionary Distances , 1974 .

[14]  S. Altschul,et al.  The compositional adjustment of amino acid substitution matrices , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[15]  S. Altschul,et al.  Improved Sensitivity of Nucleic Acid Database Searches Using Application-Specific Scoring Matrices , 1991 .

[16]  G. Crooks,et al.  A generalized affine gap model significantly improves protein sequence alignment accuracy , 2004, Proteins.

[17]  D Sankoff,et al.  Matching sequences under deletion-insertion constraints. , 1972, Proceedings of the National Academy of Sciences of the United States of America.

[18]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[19]  A. Tossi,et al.  Molecular diversity in gene-encoded, cationic antimicrobial polypeptides. , 2002, Current pharmaceutical design.

[20]  Wayne L. Smith,et al.  Indolicidin, a novel bactericidal tridecapeptide amide from neutrophils. , 1992, The Journal of biological chemistry.

[21]  M. Benincasa,et al.  Pro-rich antimicrobial peptides from animals: structure, biological functions and mechanism of action. , 2002, Current pharmaceutical design.

[22]  Claude E. Shannon,et al.  The mathematical theory of communication , 1950 .

[23]  H. Sticht,et al.  Structure determination of human and murine β‐defensins reveals structural conservation in the absence of significant sequence similarity , 2001, Protein science : a publication of the Protein Society.

[24]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[25]  R. Agarwala,et al.  Protein database searches using compositionally adjusted substitution matrices , 2005, The FEBS journal.

[26]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[27]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[28]  W. A. Beyer,et al.  Some Biological Sequence Metrics , 1976 .

[29]  Malgorzata Bogdan,et al.  On the significance of sequence alignments when using multiple scoring matrices , 2004, Bioinform..