Protein function prediction via graph kernels

MOTIVATION Computational approaches to protein function prediction infer protein function by finding proteins with similar sequence, structure, surface clefts, chemical properties, amino acid motifs, interaction partners or phylogenetic profiles. We present a new approach that combines sequential, structural and chemical information into one graph model of proteins. We predict functional class membership of enzymes and non-enzymes using graph kernels and support vector machine classification on these protein graphs. RESULTS Our graph model, derivable from protein sequence and structure only, is competitive with vector models that require additional protein information, such as the size of surface pockets. If we include this extra information into our graph model, our classifier yields significantly higher accuracy levels than the vector models. Hyperkernels allow us to select and to optimally combine the most relevant node attributes in our protein graphs. We have laid the foundation for a protein function prediction system that integrates protein information from various sources efficiently and effectively. AVAILABILITY More information available via www.dbs.ifi.lmu.de/Mitarbeiter/borgwardt.html.

[1]  A. I.,et al.  Neural Field Continuum Limits and the Structure–Function Partitioning of Cognitive–Emotional Brain Networks , 2023, Biology.

[2]  R. Grantham Amino Acid Difference Formula to Help Explain Protein Evolution , 1974, Science.

[3]  M. Charton,et al.  The structural dependence of amino acid hydrophobicity parameters. , 1982, Journal of theoretical biology.

[4]  H. Muirhead,et al.  A specific, highly active malate dehydrogenase by redesign of a lactate dehydrogenase framework. , 1988, Science.

[5]  L. Kier,et al.  Amino acid side chain parameters for correlation studies in biology and pharmacology. , 2009, International journal of peptide and protein research.

[6]  Thomas G. Dietterich,et al.  In Advances in Neural Information Processing Systems 12 , 1991, NIPS 1991.

[7]  O. Mangasarian,et al.  Robust linear programming discrimination of two linearly inseparable sets , 1992 .

[8]  H. Cid,et al.  Hydrophobicity and structural classes in proteins. , 1992, Protein engineering.

[9]  Remo Guidieri Res , 1995, RES: Anthropology and Aesthetics.

[10]  C Sander,et al.  Mapping the Protein Universe , 1996, Science.

[11]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[12]  Stephen P. Boyd,et al.  Semidefinite Programming , 1996, SIAM Rev..

[13]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[14]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[15]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[16]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[17]  D. Eisenberg,et al.  Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[18]  M. Gerstein,et al.  The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. , 1999, Journal of molecular biology.

[19]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[20]  Ioannis Xenarios,et al.  DIP: the Database of Interacting Proteins , 2000, Nucleic Acids Res..

[21]  Katya Scheinberg,et al.  Efficient SVM Training Using Low-Rank Kernel Representations , 2002, J. Mach. Learn. Res..

[22]  Frances M. G. Pearl,et al.  Quantifying the similarities within fold space. , 2002, Journal of molecular biology.

[23]  Oleg V. Tsodikov,et al.  Novel computer program for fast exact calculation of accessible and molecular surface areas and average surface curvature , 2002, J. Comput. Chem..

[24]  John D. Lafferty,et al.  Diffusion Kernels on Graphs and Other Discrete Input Spaces , 2002, ICML.

[25]  Ioannis Xenarios,et al.  DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions , 2002, Nucleic Acids Res..

[26]  Risi Kondor,et al.  Diffusion kernels on graphs and other discrete structures , 2002, ICML 2002.

[27]  Alexander J. Smola,et al.  Hyperkernels , 2002, NIPS.

[28]  Mehryar Mohri,et al.  Positive Definite Rational Kernels , 2003, COLT.

[29]  Y. Z. Chen,et al.  Protein function classification via support vector machine approach. , 2003, Mathematical biosciences.

[30]  Hisashi Kashima,et al.  Marginalized Kernels Between Labeled Graphs , 2003, ICML.

[31]  J. Whisstock,et al.  Prediction of protein function from protein sequence and structure , 2003, Quarterly Reviews of Biophysics.

[32]  Alexander J. Smola,et al.  Machine Learning with Hyperkernels , 2003, ICML.

[33]  Thomas Gärtner,et al.  On Graph Kernels: Hardness Results and Efficient Alternatives , 2003, COLT.

[34]  Janet M Thornton,et al.  Inferring protein function from structure. , 2003, Methods of biochemical analysis.

[35]  L. Kavraki,et al.  An accurate, sensitive, and scalable method to identify functional sites in protein structures. , 2003, Journal of molecular biology.

[36]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[37]  P. Dobson,et al.  Distinguishing enzyme structures from non-enzymes without alignments. , 2003, Journal of molecular biology.

[38]  Jie Liang,et al.  CASTp: Computed Atlas of Surface Topography of proteins , 2003, Nucleic Acids Res..

[39]  Frances M. G. Pearl,et al.  The CATH domain structure database. , 2005, Methods of biochemical analysis.

[40]  Y.Z. Chen,et al.  Enzyme family classification by support vector machines , 2004, Proteins.

[41]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[42]  Antje Chang,et al.  New Developments , 2003 .

[43]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[44]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[45]  M. Moorhouse,et al.  The Protein Databank , 2005 .

[46]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.