Dissimilarity Space Representations and Automatic Feature Selection for Protein Function Prediction

Dissimilarity spaces, along with feature reduction/ selection techniques, are among the mainstream approaches when dealing with pattern recognition problems in structured (and possibly non-metric) domains. In this work, we aim at investigating dissimilarity space representations in a biologyrelated application, namely protein function classification, as proteins are a seminal example of structured data given their primary and tertiary structures. Specifically, we propose two different analyses relying on both the complete dissimilarity matrix and a dimensionally-reduced version of the complete dissimilarity matrix, thereby casting the pattern recognition problem from structured domains towards real-valued feature vectors, for which any standard classification algorithm can be used. A third, hybrid, analysis uses a clustering-based oneclass classifier exploiting different representations. First results conducted on a subset of the Escherichia coli proteome are promising and some of the analyses presented in this work may also dually suit field-experts, further bridging the gap between natural sciences and computational intelligence techniques.

[1]  A. Giuliani,et al.  Granular Computing Techniques for Bioinformatics Pattern Recognition Problems in Non-metric Spaces , 2018 .

[2]  Witold Pedrycz,et al.  Granular computing: an introduction , 2001, Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569).

[3]  Alessandro Giuliani,et al.  Toward a Multilevel Representation of Protein Molecules: Comparative Approaches to the Aggregation/Folding Propensity Problem , 2014, Inf. Sci..

[4]  A. F. Cardona-Escobar,et al.  A methodology for the prediction of Embryophyta protein functions using mismatch kernels , 2015, 2015 20th Symposium on Signal Processing, Images and Computer Vision (STSIVA).

[5]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[6]  Alessandro Giuliani,et al.  Spectral reconstruction of protein contact networks , 2017 .

[7]  Alessandro Giuliani,et al.  Supervised Approaches for Function Prediction of Proteins Contact Networks from Topological Structure Information , 2017, SCIA.

[8]  Alessandro Giuliani,et al.  A generative model for protein contact networks , 2015, Journal of biomolecular structure & dynamics.

[9]  Lorenzo Livi,et al.  A Granular Computing approach to the design of optimized graph classification systems , 2014, Soft Comput..

[10]  G. Seber Multivariate observations / G.A.F. Seber , 1983 .

[11]  Lorenzo Livi,et al.  Modeling and recognition of smart grid faults by a combined approach of dissimilarity learning and one-class classification , 2014, Neurocomputing.

[12]  Marcel J. T. Reinders,et al.  Pattern recognition in bioinformatics , 2013, Briefings Bioinform..

[13]  Jano I. van Hemert,et al.  EnzML: multi-label prediction of enzyme classes using InterPro signatures , 2012, BMC Bioinformatics.

[14]  Yong Wang,et al.  Support vector machine prediction of enzyme function with conjoint triad feature and hierarchical context , 2011, BMC Systems Biology.

[15]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[16]  Masaru Tomita,et al.  Proteins as networks: usefulness of graph theory in protein science. , 2008, Current protein & peptide science.

[17]  Lorenzo Livi,et al.  Building pattern recognition applications with the SPARE library , 2014, ArXiv.

[18]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[19]  Kuo-Chen Chou,et al.  Using functional domain composition to predict enzyme family classes. , 2005, Journal of proteome research.

[20]  Mostafa G. M. Mostafa,et al.  A modified cutoff scanning matrix protein representation for enhancing protein function prediction , 2014, 2014 9th International Conference on Informatics and Systems.

[21]  Antonello Rizzi,et al.  Efficient Approaches for Solving the Large-Scale k-medoids Problem , 2017, IJCCI.

[22]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[23]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[24]  Robert P. W. Duin,et al.  The Dissimilarity Representation for Pattern Recognition - Foundations and Applications , 2005, Series in Machine Perception and Artificial Intelligence.

[25]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[26]  Hong Yan,et al.  Pattern recognition techniques for the emerging field of bioinformatics: A review , 2005, Pattern Recognit..

[27]  Lorenzo Livi,et al.  Granular modeling and computing approaches for intelligent analysis of non-geometric data , 2015, Appl. Soft Comput..

[28]  Lorenzo Livi,et al.  Optimized dissimilarity space embedding for labeled graphs , 2014, Inf. Sci..

[29]  A. Giuliani,et al.  Protein contact networks: an emerging paradigm in chemistry. , 2013, Chemical reviews.

[30]  Alessandro Giuliani,et al.  Characterization of Graphs for Protein Structure Modeling and Recognition of Solubility , 2014, ArXiv.

[31]  A. Dillmann Enzyme Nomenclature , 1965, Nature.

[32]  Gunnar E. Carlsson,et al.  Topology and data , 2009 .

[33]  L. Wasserman Topological Data Analysis , 2016, 1609.08227.

[34]  Antonello Rizzi,et al.  Adaptive resolution min-max classifiers , 2002, IEEE Trans. Neural Networks.

[35]  Lorenzo Livi,et al.  On the Problem of Modeling Structured Data with the MinSOD Representative , 2014 .

[36]  L. Pauling,et al.  The structure of proteins; two hydrogen-bonded helical configurations of the polypeptide chain. , 1951, Proceedings of the National Academy of Sciences of the United States of America.

[37]  Antonello Rizzi,et al.  A Dissimilarity Learning Approach by Evolutionary Computation for Faults Recognition in Smart Grids , 2014, IJCCI.