Positive and Unlabeled Learning through Negative Selection and Imbalance-aware Classification

Motivated by applications in protein function prediction, we consider a challenging supervised classification setting in which positive labels are scarce and there are no explicit negative labels. The learning algorithm must thus select which unlabeled examples to use as negative training points, possibly ending up with an unbalanced learning problem. We address these issues by proposing an algorithm that combines active learning (for selecting negative examples) with imbalance-aware learning (for mitigating the label imbalance). In our experiments we observe that these two techniques operate synergistically, outperforming state-of-the-art methods on standard protein function prediction benchmarks.

[1]  Dennis Shasha,et al.  Parametric Bayesian priors and better choice of negative examples improve protein function prediction , 2013, Bioinform..

[2]  R. Sharan,et al.  Network-based prediction of protein function , 2007, Molecular systems biology.

[3]  Michael I. Jordan,et al.  Consistent probabilistic outputs for protein function prediction , 2008, Genome Biology.

[4]  Giorgio Valentini,et al.  UNIPred: Unbalance-Aware Network Integration and Prediction of Protein Functions , 2015, J. Comput. Biol..

[5]  Alessandro Vespignani,et al.  Global protein function prediction from protein-protein interaction networks , 2003, Nature Biotechnology.

[6]  Christos Faloutsos,et al.  Random walk with restart: fast solutions and applications , 2008, Knowledge and Information Systems.

[7]  Giorgio Valentini,et al.  A Fast Ranking Algorithm for Predicting Gene Functions in Biomolecular Networks , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  Takaya Saito,et al.  The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets , 2015, PloS one.

[9]  Carol A. Bocchini,et al.  A new face and new challenges for Online Mendelian Inheritance in Man (OMIM®) , 2011, Human mutation.

[10]  B. Schwikowski,et al.  A network of protein–protein interactions in yeast , 2000, Nature Biotechnology.

[11]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  Marco Frasca,et al.  Multitask Protein Function Prediction through Task Dissimilarity , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[14]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[15]  Burr Settles,et al.  Active Learning , 2012, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[16]  Giuliano Armano,et al.  RANKS: a flexible tool for node label ranking and classification in biological networks , 2016, Bioinform..

[17]  Ping Fu,et al.  A Hierarchical Multi-Label Classification Algorithm for Gene Function Prediction , 2017 .

[18]  Giorgio Valentini,et al.  COSNet: An R package for label prediction in unbalanced biological networks , 2017, Neurocomputing.

[19]  Jason Weston,et al.  Fast Kernel Classifiers with Online and Active Learning , 2005, J. Mach. Learn. Res..

[20]  Giorgio Valentini,et al.  COSNet: A Cost Sensitive Neural Network for Semi-supervised Learning in Graphs , 2011, ECML/PKDD.

[21]  Giorgio Valentini,et al.  Hierarchical Ensemble Methods for Protein Function Prediction , 2014, ISRN bioinformatics.

[22]  Giorgio Valentini,et al.  A Hierarchical Ensemble Method for DAG-Structured Taxonomies , 2015, MCS.

[23]  Quaid Morris,et al.  Fast integration of heterogeneous data sources for predicting gene function with limited annotation , 2010, Bioinform..

[24]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[25]  Charles Elkan,et al.  Learning classifiers from only positive and unlabeled data , 2008, KDD.

[26]  Xing-Ming Zhao,et al.  Gene function prediction using labeled and unlabeled data , 2008, BMC Bioinformatics.

[27]  Asa Ben-Hur,et al.  Hierarchical Classification of Gene Ontology Terms Using the Gostruct Method , 2010, J. Bioinform. Comput. Biol..

[28]  Andreas Ziegler,et al.  ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R , 2015, 1508.04409.

[29]  Quaid Morris,et al.  Using the Gene Ontology Hierarchy when Predicting Gene Function , 2009, UAI.

[30]  Nicolò Cesa-Bianchi,et al.  Hierarchical Cost-Sensitive Algorithms for Genome-Wide Gene Function Prediction , 2009, MLSB.

[31]  David Warde-Farley,et al.  GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function , 2008, Genome Biology.