Evaluating the impact of topological protein features on the negative examples selection

BackgroundSupervised machine learning methods when applied to the problem of automated protein-function prediction (AFP) require the availability of both positive examples (i.e., proteins which are known to possess a given protein function) and negative examples (corresponding to proteins not associated with that function). Unfortunately, publicly available proteome and genome data sources such as the Gene Ontology rarely store the functions not possessed by a protein. Thus the negative selection, consisting in identifying informative negative examples, is currently a central and challenging problem in AFP. Several heuristics have been proposed through the years to solve this problem; nevertheless, despite their effectiveness, to the best of our knowledge no previous existing work studied which protein features are more relevant to this task, that is, which protein features help more in discriminating reliable and unreliable negatives.ResultsThe present work analyses the impact of several features on the selection of negative proteins for the Gene Ontology (GO) terms. The analysis is network-based: it exploits the fact that proteins can be naturally structured in a network, considering the pairwise relationships coming from several sources of data, such as protein-protein and genetic interactions. Overall, the proposed protein features, including local and global graph centrality measures and protein multifunctionality, can be term-aware (i.e., depending on the GO term) and term-unaware (i.e., invariant across the GO terms). We validated the informativeness of each feature utilizing a temporal holdout in three different experiments on yeast, mouse and human proteomes: (i) feature selection to detect which protein features are more helpful for the negative selection; (ii) protein function prediction to verify whether the features considered are also useful to predict GO terms; (iii) negative selection by applying two different negative selection algorithms on proteins represented through the proposed features.ConclusionsTerm-aware features (with some exceptions) resulted more informative for problem (i), together with node betweenness, which is the most relevant among term-unaware features. The node positive neighborhood instead is the most predictive feature for the AFP problem, while experiment (iii) showed that the proposed features allow negative selection algorithms to select effectively negative instances in the temporal holdout setting, with better results when nonlinear combinations of features are also exploited.

[1]  Klamer Schutte,et al.  Selection of negative samples and two-stage combination of multiple features for action detection in thousands of videos , 2013, Machine Vision and Applications.

[2]  D. Eisenberg,et al.  A combined algorithm for genome-wide prediction of protein function , 1999, Nature.

[3]  John Skvoretz,et al.  Node centrality in weighted networks: Generalizing degree and shortest paths , 2010, Soc. Networks.

[4]  Dario Malchiodi,et al.  Selection of Negative Examples for Node Label Prediction Through Fuzzy Clustering Techniques , 2015, Advances in Neural Networks.

[5]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[6]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[7]  Sebastiano Vigna,et al.  Axioms for Centrality , 2013, Internet Math..

[8]  Davide Heller,et al.  STRING v10: protein–protein interaction networks, integrated over the tree of life , 2014, Nucleic Acids Res..

[9]  Katharina Morik,et al.  Combining Statistical Learning with a Knowledge-Based Approach - A Case Study in Intensive Care Monitoring , 1999, ICML.

[10]  Alessandro Vespignani,et al.  Global protein function prediction from protein-protein interaction networks , 2003, Nature Biotechnology.

[11]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[12]  Jagdish Chandra Patra,et al.  Integration of multiple data sources to prioritize candidate genes using discounted rating system , 2010, BMC Bioinformatics.

[13]  H. Mewes,et al.  The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. , 2004, Nucleic acids research.

[14]  Quaid Morris,et al.  Using the Gene Ontology Hierarchy when Predicting Gene Function , 2009, UAI.

[15]  Tapio Salakoski,et al.  An expanded evaluation of protein function prediction methods shows an improvement in accuracy , 2016, Genome Biology.

[16]  L. Freeman Centrality in social networks conceptual clarification , 1978 .

[17]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[18]  Q. Morris,et al.  Labeling Nodes Using Three Degrees of Propagation , 2012, PloS one.

[19]  Giorgio Valentini,et al.  UNIPred: Unbalance-Aware Network Integration and Prediction of Protein Functions , 2015, J. Comput. Biol..

[20]  B. Schwikowski,et al.  A network of protein–protein interactions in yeast , 2000, Nature Biotechnology.

[21]  Jeroen de Ridder,et al.  Scale-space measures for graph topology link protein network architecture to function , 2014, Bioinform..

[22]  Dennis Shasha,et al.  Negative Example Selection for Protein Function Prediction: The NoGO Database , 2014, PLoS Comput. Biol..

[23]  N. Lin Foundations of social research , 1976 .

[24]  Quaid Morris,et al.  Fast integration of heterogeneous data sources for predicting gene function with limited annotation , 2010, Bioinform..

[25]  J. Anthonisse The rush in a directed graph , 1971 .

[26]  Alex Bavelas,et al.  Communication Patterns in Task‐Oriented Groups , 1950 .

[27]  Ambuj K. Singh,et al.  Molecular Function Prediction Using Neighborhood Features , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[28]  A. Vespignani,et al.  The architecture of complex weighted networks. , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[29]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[30]  Dario Malchiodi,et al.  Analysis of Informative Features for Negative Selection in Protein Function Prediction , 2017, IWBBIO.

[31]  Jesse Gillis,et al.  The Impact of Multifunctional Genes on "Guilt by Association" Analysis , 2011, PloS one.

[32]  Leonard M. Freeman,et al.  A set of measures of centrality based upon betweenness , 1977 .

[33]  Giorgio Valentini,et al.  Learning node labels with multi-category Hopfield networks , 2016, Neural Computing and Applications.

[34]  Jean-Philippe Vert,et al.  A bagging SVM to learn from positive and unlabeled examples , 2010, Pattern Recognit. Lett..

[35]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[36]  Dario Malchiodi,et al.  Exploiting Negative Sample Selection for Prioritizing Candidate Disease Genes , 2017 .

[37]  Dennis Shasha,et al.  Parametric Bayesian priors and better choice of negative examples improve protein function prediction , 2013, Bioinform..

[38]  Marco Frasca,et al.  Automated gene function prediction through gene multifunctionality in biological networks , 2015, Neurocomputing.

[39]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[40]  William Stafford Noble,et al.  Learning to predict protein-protein interactions from protein sequences , 2003, Bioinform..