Fold-specific substitution matrices for protein classification

MOTIVATION Methods that focus on secondary structures, such as Position Specific Scoring Matrices and Hidden Markov Models, have proved useful for assigning proteins to families. However, for assigning proteins to an attribute class within a family these methods may introduce more free parameters than are needed. There are fewer members and there is less variability among sequences within a family. We describe a method for organizing proteins in a family that exhibits up to an order of magnitude reduction in the number of parameters. The basis is the log odds ratio commonly used to measure similarity. We adapt this to characterize the sequence dissimilarities that give rise to attribute differentiation. This leads to the definition of Class Attribute Substitution Matrices (CLASSUM), a dual of the BLOSUM. RESULTS The method was applied to classify sequences hierarchically in the lambda and kappa subgroups of the immunoglobulin superfamily. Positions conferring class were identified based on the degree of amino acid variability at a position. The CLASSUM computed for these positions classified better than 90% of test data correctly compared with 35-50% for BLOSUM-62. The expected value for a random matrix is 14%. The results suggest that family-specific data-derived substitution matrices can improve the resolution of automated methods that use generic substitution matrices for searching for and classifying proteins.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  S. Henikoff,et al.  Amino acid substitution matrices. , 2000, Advances in protein chemistry.

[3]  A. Poupon,et al.  The immunoglobulin fold family: sequence analysis and 3D structure comparisons. , 1999, Protein engineering.

[4]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[5]  L Holm,et al.  Towards a covering set of protein family profiles. , 2000, Progress in biophysics and molecular biology.

[6]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[7]  James E. Johnson,et al.  MetaFam: a unified classification of protein families. I. Overview and statistics , 2001, Bioinform..

[8]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[9]  C. Chang,et al.  A molecular model for self-assembly of amyloid fibrils: immunoglobulin light chains. , 1995, Biochemistry.

[10]  Pierre Baldi,et al.  Hybrid Modeling, HMM/NN Architectures, and Protein Applications , 1996, Neural Computation.

[11]  James E. Johnson,et al.  MetaFam: a unified classification of protein families. II. Schema and query capabilities , 2001, Bioinform..

[12]  Wu Tie-jun Support vector machines for pattern recognition , 2003 .

[13]  A. C. May,et al.  Optimal classification of protein sequences and selection of representative sets from multiple alignments: application to homologous families and lessons for structural genomics. , 2001, Protein engineering.

[14]  N Linial,et al.  Global self-organization of all known protein sequences reveals inherent biological signatures. , 1997, Journal of molecular biology.

[15]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[16]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[17]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[18]  Jérôme Gracy,et al.  Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment , 1998, Bioinform..

[19]  Anders Krogh,et al.  Hidden Neural Networks , 1999, Neural Computation.

[20]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[21]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[23]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[24]  E. Shakhnovich,et al.  Understanding hierarchical protein evolution from first principles. , 2001, Journal of molecular biology.

[25]  Terrence L. Fine,et al.  Feedforward Neural Network Methodology , 1999, Information Science and Statistics.

[26]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[27]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[28]  G. De Soete,et al.  Clustering and Classification , 2019, Data-Driven Science and Engineering.