Interaction-based feature selection for predicting cancer-related proteins in protein-protein interaction networks

The task of predicting in a protein-protein-interaction (PPI) network which proteins are involved in certain diseases, such as cancer, has received a significant amount of attention in the literature [1, 4]. Multiple approaches haven been proposed, some based on graph algorithms, some on standard machine learning approaches. Machine learning approaches such as Milenkovic et al.[5], Furney et al. [1], Li et al. [4], Furney et al. [2] and Kar et al. [3] typically use a featurebased representation of proteins as input, and their success depends strongly on the relevance of the selected features. In earlier work it has been shown that the Gene Ontology (GO) annotations of a protein have high relevance. For instance, Li et al. [4] found predictive performance to depend only slightly on the chosen machine learning method, but strongly on the chosen features, and among many features considered, GO annotations turned out to be particularly important. In previous work, when a protein p is to be classified as disease-related or not, the GO annotations used for that prediction are usually those of p itself. In this paper, we present a new type of GO-based features. These features are based not on the GO annotation (“function”) of a single protein, but on pairs of functions that occur on both sides of an edge in the PPI network. We call them interaction-based features.