Predicting kinase functional sites using hierarchical stochastic language modelling

Results: Our method is validated by using two complementary approaches. Firstly, the predicted functional sites using the HSL were compared with experimentally verified functional sites including the patterns in PROSITE, the contacting sites in the Protein Data Bank (PDB), and the domains in Pfam. Compared to the patterns in PROSITE and the contacting sites in PDB, the overall average recall/precision of the HSL model was 83.5% / 23.0% and 66.1% / 79.9%, respectively. Compared to Pfam, 90% of the predicted functional sites were parts of domains with names containing the substring “kinase”. Secondly, 10-fold cross-validation was used to study the kinase function prediction accuracy of the HSL. The HSL achieved both high sensitivity (94.7%) and specificity (94.0%) compared to 94.5% and 85.8%, respectively, for MEME. The HSL model automatically detected kinase sub-families. The identified sub-families were consistent with known phylogenetic trees of the kinase sequences. Therefore, the HSL was applicable to kinase sequences with heterogeneous subsets sharing the same catalysis function.

[1]  M. Billeter,et al.  MOLMOL: a program for display and analysis of macromolecular structures. , 1996, Journal of molecular graphics.

[2]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[3]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[4]  Sudhir Kumar,et al.  MEGA3: Integrated software for Molecular Evolutionary Genetics Analysis and sequence alignment , 2004, Briefings Bioinform..

[5]  Walter M. Fitch,et al.  Protein Structure and Evolution , 1976 .

[6]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[7]  Sameer Velankar,et al.  E-MSD: an integrated data resource for bioinformatics , 2004, Nucleic Acids Res..

[8]  David S. Wishart,et al.  SuperPose: a simple server for sophisticated structural superposition , 2004, Nucleic Acids Res..

[9]  H J Fromm,et al.  Crystal structures of mutant monomeric hexokinase I reveal multiple ADP binding sites and conformational changes relevant to allosteric regulation. , 2000, Journal of molecular biology.

[10]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt): an expanding universe of protein information , 2005, Nucleic Acids Res..

[11]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[12]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[13]  K. S. Fu,et al.  Syntactic Pattern Recognition and its Applications to Signal Processing , 1978 .

[14]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[15]  G. Phillips,et al.  The closed conformation of a highly flexible protein: The structure of E. coli adenylate kinase with bound AMP and AMPPNP , 1994, Proteins.

[16]  Amos Bairoch,et al.  The ENZYME database in 2000 , 2000, Nucleic Acids Res..

[17]  C Vonrhein,et al.  The structure of a trimeric archaeal adenylate kinase. , 1998, Journal of molecular biology.

[18]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Shmuel Pietrokovski,et al.  Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations , 1999, Bioinform..

[20]  F. Yates Contingency Tables Involving Small Numbers and the χ2 Test , 1934 .