Exploiting PubMed for Protein Molecular Function Prediction via NMF Based Multi-label Classification

Gene ontology (GO) defines terms and classes used to describe gene functions and relationships between them. GO has been the standard to describing the functions of specific genes in different model organisms. GO annotation which tags genes with GO terms has mostly been a manual and timeconsuming curation process. In this paper we describe the development and evaluation of an innovative predictive system to automatically assign a gene its molecular functions (GO terms) using biomedical literature as a resource. We treated a GO term assignment as a multi-label multi-class classification problem. Rather than the commonly used bag-of-words approach, we used non-negative matrix factorization (NMF) for feature reduction and then performed the classification of genes. To address the multi-label aspect of the data, we used the binary-relevance method. We experimented with different classifiers and found that the combination of binary relevance and K-nearest neighbor (KNN) classifier gave the best performance. Our evaluation on UniProtKB/Swiss-Prot dataset showed the best performance of .83 in terms of F-measure.

[1]  Ulf Leser,et al.  Mining phenotypes for gene function prediction , 2008, BMC Bioinformatics.

[2]  Limsoon Wong,et al.  Exploiting Indirect Neighbours and Topological Weight to Predict Protein Function from Protein-Protein Interactions , 2006, BioDM.

[3]  Günther Zehetner,et al.  OntoBlast function: from sequence similarities directly to potential functional annotations by ontology terms , 2003, Nucleic Acids Res..

[4]  C. Orengo,et al.  Protein function prediction--the power of multiplicity. , 2009, Trends in biotechnology.

[5]  Miguel A. Andrade-Navarro,et al.  Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families , 1998, Bioinform..

[6]  B. Rost,et al.  Automatic prediction of protein function , 2003, Cellular and Molecular Life Sciences CMLS.

[7]  Michael J. E. Sternberg,et al.  ConFunc - functional annotation in the twilight zone , 2008, Bioinform..

[8]  Hagit Shatkay,et al.  Protein Function Prediction using Text-based Features extracted from the Biomedical Literature: The CAFA Challenge , 2013, BMC Bioinformatics.

[9]  Juan Miguel García-Gómez,et al.  Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research , 2005, Bioinform..

[10]  Hagit Shatkay,et al.  Text as data: using text-based features for proteins representation and for computational prediction of their characteristics. , 2015, Methods.

[11]  Limsoon Wong,et al.  Exploiting indirect neighbours and topological weight to predict protein function from protein--protein interactions , 2006 .

[12]  Iddo Friedberg,et al.  Automated protein function predictionçthe genomic challenge , 2006 .

[13]  Peer Bork,et al.  Evaluation of human-readable annotation in biomolecular sequence databases with biological rule libraries , 1999, Bioinform..

[14]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[15]  Geoffrey J. Barton,et al.  GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes , 2004, BMC Bioinformatics.

[16]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[17]  Jung-Hsien Chiang,et al.  MeKE: Discovering the Functions of Gene Products from Biomedical Literature Via Sentence Alignment , 2003, Bioinform..

[18]  Goran Nenadic,et al.  Selecting Text Features for Gene Name Classification: from Documents to Terms , 2003, BioNLP@ACL.

[19]  Miguel A. Andrade-Navarro,et al.  Gene annotation from scientific literature using mappings between keyword systems , 2004, Bioinform..

[20]  Michael J. E. Sternberg,et al.  Predicting the Sub-Cellular Location of Proteins from Text Using Support Vector Machines , 2001, Pacific Symposium on Biocomputing.

[21]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[22]  David Warde-Farley,et al.  GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function , 2008, Genome Biology.

[23]  Karin M. Verspoor,et al.  Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct , 2015, J. Biomed. Semant..

[24]  A. Valencia Automatic annotation of protein function. , 2005, Current opinion in structural biology.

[25]  M. Sternberg,et al.  Automated prediction of protein function and detection of functional sites from structure. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Toshihisa Takagi,et al.  Data and text mining Automatic extraction of gene / protein biological functions from biomedical text , 2005 .

[27]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[28]  G. Vriend,et al.  A text-mining analysis of the human phenome , 2006, European Journal of Human Genetics.