Supervised Approaches for Protein Function Prediction by Topological Data Analysis

Topological Data Analysis is a novel approach, useful whenever data can be described by topological structures such as graphs. The aim of this paper is to investigate whether such tool can be used in order to define a set of descriptors useful for pattern recognition and machine learning tasks.Specifically, we consider a supervised learning problem with the final goal of predicting proteins' physiological function starting from their respective residue contact network. Indeed, folded proteins can effectively be described by graphs, making them a useful case-study for assessing Topological Data Analysis effectiveness concerning pattern recognition tasks.Experiments conducted on a subset of the Escherichia coli proteome using two different classification systems show that descriptors derived from Topological Data Analysis - namely, the Betti numbers sequence - lead to classification performances comparable with descriptors derived from widely-known centrality measures, as concerns the protein function prediction problem. Further benchmarking tests suggest the presence of some information despite the heavy compression intrinsic to the protein-to-Betti numbers casting.

[1]  R. Ghrist Barcodes: The persistent topology of data , 2007 .

[2]  A. Giuliani,et al.  Protein contact networks: an emerging paradigm in chemistry. , 2013, Chemical reviews.

[3]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[4]  C. Guda,et al.  Application of a hierarchical enzyme classification method reveals the role of gut microbiome in human metabolism , 2015, BMC Genomics.

[5]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[6]  A. Dillmann Enzyme Nomenclature , 1965, Nature.

[7]  Alessandro Giuliani,et al.  Characterization of Graphs for Protein Structure Modeling and Recognition of Solubility , 2014, ArXiv.

[8]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[9]  Gunnar E. Carlsson,et al.  Topology and data , 2009 .

[10]  L. Wasserman Topological Data Analysis , 2016, 1609.08227.

[11]  Jano I. van Hemert,et al.  EnzML: multi-label prediction of enzyme classes using InterPro signatures , 2012, BMC Bioinformatics.

[12]  The Uniprot Consortium UniProt: the universal protein knowledgebase , 2018, Nucleic acids research.

[13]  Yong Wang,et al.  Support vector machine prediction of enzyme function with conjoint triad feature and hierarchical context , 2011, BMC Systems Biology.

[14]  Julio Saez-Rodriguez,et al.  BioServices: a common Python package to access biological Web Services programmatically , 2013, Bioinform..

[15]  A. Giuliani,et al.  Granular Computing Techniques for Bioinformatics Pattern Recognition Problems in Non-metric Spaces , 2018 .

[16]  L. Pauling,et al.  The structure of proteins; two hydrogen-bonded helical configurations of the polypeptide chain. , 1951, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Alfredo Benso,et al.  Combining homolog and motif similarity data with Gene Ontology relationships for protein function prediction , 2012, 2012 IEEE International Conference on Bioinformatics and Biomedicine.

[18]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[19]  Júlio C. Nievola,et al.  Multi-Label Hierarchical Classification using a Competitive Neural Network for protein function prediction , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[20]  Alessandro Giuliani,et al.  Spectral reconstruction of protein contact networks , 2017 .

[21]  Alessandro Giuliani,et al.  Supervised Approaches for Function Prediction of Proteins Contact Networks from Topological Structure Information , 2017, SCIA.

[22]  Lorenzo Livi,et al.  The graph matching problem , 2012, Pattern Analysis and Applications.

[23]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[24]  Kuo-Chen Chou,et al.  Using functional domain composition to predict enzyme family classes. , 2005, Journal of proteome research.

[25]  Lorenzo Livi,et al.  Graph ambiguity , 2013, Fuzzy Sets Syst..

[26]  Aric Hagberg,et al.  Exploring Network Structure, Dynamics, and Function using NetworkX , 2008, Proceedings of the Python in Science Conference.

[27]  Witold Pedrycz,et al.  Granular computing: an introduction , 2001, Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569).

[28]  J. Hausmann On the Vietoris-Rips complexes and a Cohomology Theory for metric spaces , 1996 .

[29]  Afra Zomorodian,et al.  Computing Persistent Homology , 2004, SCG '04.

[30]  John B. O. Mitchell,et al.  From sequence to enzyme mechanism using multi-label machine learning , 2014, BMC Bioinformatics.

[31]  A. F. Cardona-Escobar,et al.  A methodology for the prediction of Embryophyta protein functions using mismatch kernels , 2015, 2015 20th Symposium on Signal Processing, Images and Computer Vision (STSIVA).

[32]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[33]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[34]  Alessandro Giuliani,et al.  A generative model for protein contact networks , 2015, Journal of biomolecular structure & dynamics.

[35]  Mostafa G. M. Mostafa,et al.  A modified cutoff scanning matrix protein representation for enhancing protein function prediction , 2014, 2014 9th International Conference on Informatics and Systems.

[36]  Sebastian Raschka,et al.  BioPandas: Working with molecular structures in pandas DataFrames , 2017, J. Open Source Softw..

[37]  Amir Rahimi,et al.  Efficacy of function specific 3D-motifs in enzyme classification according to their EC-numbers. , 2013, Journal of theoretical biology.