Machine learning for discovering missing or wrong protein function annotations

BackgroundA massive amount of proteomic data is generated on a daily basis, nonetheless annotating all sequences is costly and often unfeasible. As a countermeasure, machine learning methods have been used to automatically annotate new protein functions. More specifically, many studies have investigated hierarchical multi-label classification (HMC) methods to predict annotations, using the Functional Catalogue (FunCat) or Gene Ontology (GO) label hierarchies. Most of these studies employed benchmark datasets created more than a decade ago, and thus train their models on outdated information. In this work, we provide an updated version of these datasets. By querying recent versions of FunCat and GO yeast annotations, we provide 24 new datasets in total. We compare four HMC methods, providing baseline results for the new datasets. Furthermore, we also evaluate whether the predictive models are able to discover new or wrong annotations, by training them on the old data and evaluating their results against the most recent information.ResultsThe results demonstrated that the method based on predictive clustering trees, Clus-Ensemble, proposed in 2008, achieved superior results compared to more recent methods on the standard evaluation task. For the discovery of new knowledge, Clus-Ensemble performed better when discovering new annotations in the FunCat taxonomy, whereas hierarchical multi-label classification with genetic algorithm (HMC-GA), a method based on genetic algorithms, was overall superior when detecting annotations that were removed. In the GO datasets, Clus-Ensemble once again had the upper hand when discovering new annotations, HMC-GA performed better for detecting removed annotations. However, in this evaluation, there were less significant differences among the methods.ConclusionsThe experiments have showed that protein function prediction is a very challenging task which should be further investigated. We believe that the baseline results associated with the updated datasets provided in this work should be considered as guidelines for future studies, nonetheless the old versions of the datasets should not be disregarded since other tasks in machine learning could benefit from them.

[1]  Alex Alves Freitas,et al.  A grammatical evolution algorithm for generation of Hierarchical Multi-Label Classification rules , 2013, 2013 IEEE Congress on Evolutionary Computation.

[2]  Rodrigo C. Barros,et al.  Hierarchical Multi-Label Classification Networks , 2018, ICML.

[3]  Yannis Papanikolaou,et al.  Hierarchical Partitioning of the Output Space in Multi-label Data , 2016, Data Knowl. Eng..

[4]  H. Mewes,et al.  The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. , 2004, Nucleic acids research.

[5]  Ping Fu,et al.  A hierarchical multi-label classification method based on neural networks for gene function prediction , 2018, Biotechnology & Biotechnological Equipment.

[6]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Multi-label Feature Selection Techniques for Hierarchical Multi-label Protein Function Prediction , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[7]  Salabat Khan,et al.  Ant colony optimization based hierarchical multi-label classification algorithm , 2017, Appl. Soft Comput..

[8]  James T. Kwok,et al.  Bayes-Optimal Hierarchical Multilabel Classification , 2015, IEEE Transactions on Knowledge and Data Engineering.

[9]  Enrico Blanzieri,et al.  AWX: An Integrated Approach to Hierarchical-Multilabel Classification , 2018, ECML/PKDD.

[10]  Michelangelo Ceci,et al.  Using PPI network autocorrelation in hierarchical multi-label classification trees for gene function prediction , 2013, BMC Bioinformatics.

[11]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[12]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Reduction strategies for hierarchical multi-label classification in protein function prediction , 2016, BMC Bioinformatics.

[13]  Yu Li,et al.  mlDEEPre: Multi-Functional Enzyme Function Prediction With Hierarchical Multi-Label Deep Learning , 2019, Front. Genet..

[14]  Luis Enrique Sucar,et al.  Chained Path Evaluation for Hierarchical Multi-Label Classification , 2014, FLAIRS Conference.

[15]  Gunnar Rätsch,et al.  Next generation genome annotation with mGene.ngs , 2010, BMC Bioinformatics.

[16]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  A genetic algorithm for Hierarchical Multi-Label Classification , 2012, SAC '12.

[17]  Yu Li,et al.  Deep learning in bioinformatics: introduction, application, and perspective in big data era , 2019, bioRxiv.

[18]  Stefan Wermter,et al.  Continual Lifelong Learning with Neural Networks: A Review , 2019, Neural Networks.

[19]  James T. Kwok,et al.  MultiLabel Classification on Tree- and DAG-Structured Hierarchies , 2011, ICML.

[20]  Alex A. Freitas,et al.  A survey of hierarchical classification across different application domains , 2010, Data Mining and Knowledge Discovery.

[21]  Maxat Kulmanov,et al.  DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier , 2017, Bioinform..

[22]  Midori A. Harris,et al.  The Gene Ontology project , 2005 .

[23]  Anne M. P. Canuto,et al.  Applying semi-supervised learning in hierarchical multi-label classification , 2014, Expert Syst. Appl..

[24]  Wei Chen,et al.  SupportNet: solving catastrophic forgetting in class incremental learning with support data , 2018, ArXiv.

[25]  Júlio C. Nievola,et al.  Multi-Label Hierarchical Classification using a Competitive Neural Network for protein function prediction , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[26]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Hierarchical multi-label classification for protein function prediction: A local approach based on neural networks , 2011, 2011 11th International Conference on Intelligent Systems Design and Applications.

[27]  Larisa Shwartz,et al.  Knowledge Guided Hierarchical Multi-Label Classification Over Ticket Data , 2017, IEEE Transactions on Network and Service Management.

[28]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  New top-down methods using SVMs for Hierarchical Multilabel Classification problems , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[29]  Konstantinos Pliakos,et al.  Mining features for biomedical data using clustering tree ensembles , 2018, J. Biomed. Informatics.

[30]  Saso Dzeroski,et al.  Predicting gene function using hierarchical multi-label decision tree ensembles , 2010, BMC Bioinformatics.

[31]  Lihua Li,et al.  DEEPre: sequence-based enzyme EC number prediction by deep learning , 2017, Bioinform..

[32]  Alex Alves Freitas,et al.  Evolving relational hierarchical classification rules for predicting gene ontology-based protein functions , 2014, GECCO.

[33]  Helyane Bronoski Borges,et al.  An Adaptation of the ML-kNN Algorithm to Predict the Number of Classes in Hierarchical Multi-label Classification , 2017, MDAI.

[34]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[35]  2018 International Joint Conference on Neural Networks, IJCNN 2018, Rio de Janeiro, Brazil, July 8-13, 2018 , 2018, IJCNN.

[36]  Celine Vens,et al.  Labelling strategies for hierarchical multi-label classification techniques , 2016, Pattern Recognit..

[37]  M Ouali,et al.  Cascaded multiple classifiers for secondary structure prediction , 2000, Protein science : a publication of the Protein Society.

[38]  Alex Alves Freitas,et al.  Probabilistic Clustering for Hierarchical Multi-Label Classification of Protein Functions , 2013, ECML/PKDD.

[39]  Giorgio Valle,et al.  The Gene Ontology project in 2008 , 2007, Nucleic Acids Res..

[40]  Luis Enrique Sucar,et al.  Hierarchical multilabel classification based on path evaluation , 2016, Int. J. Approx. Reason..

[41]  Saeed Jalili,et al.  VR-BFDT: A variance reduction based binary fuzzy decision tree induction method for protein function prediction. , 2015, Journal of theoretical biology.

[42]  Larisa Shwartz,et al.  Hierarchical multi-label classification over ticket data using contextual loss , 2014, 2014 IEEE Network Operations and Management Symposium (NOMS).

[43]  Ping Fu,et al.  A Hierarchical Multi-Label Classification Algorithm for Gene Function Prediction , 2017 .

[44]  M. de Rijke,et al.  Hierarchical multi-label classification of social text streams , 2014, SIGIR.

[45]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Inducing Hierarchical Multi-label Classification rules with Genetic Algorithms , 2019, Appl. Soft Comput..

[46]  Alex Alves Freitas,et al.  A hierarchical multi-label classification ant colony algorithm for protein function prediction , 2010, Memetic Comput..

[47]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Hierarchical classification of Gene Ontology-based protein functions with neural networks , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[48]  Saso Dzeroski,et al.  Decision trees for hierarchical multi-label classification , 2008, Machine Learning.

[49]  Yangyang Zhao,et al.  Hierarchical Multilabel Classification with Optimal Path Prediction , 2016, Neural Processing Letters.

[50]  Zixiang Wang,et al.  Ontological function annotation of long non‐coding RNAs through hierarchical multi‐label classification , 2018, Bioinform..