Discovering New Gene Functionalities from Random Perturbations of Known Gene Ontological Annotations

Genomic annotations describing functional features of genes and proteins through controlled terminologies and ontologies are extremely valuable, especially for computational analyses aimed at inferring new biomedical knowledge. Thanks to the biology revolution led by the introduction of the novel DNA sequencing technologies, several repositories of such annotations have becoming available in the last decade; among them, the ones including Gene Ontology annotations are the most relevant. Nevertheless, the available set of genomic annotations is incomplete, and only some of the available annotations represent highly reliable human curated information. In this paper we propose a novel representation of the annotation discovery problem, so as to enable applying supervised algorithms to predict Gene Ontology annotations of different organism genes. In order to use supervised algorithms despite labeled data to train the prediction model are not available, we propose a random perturbation method of the training set, which creates a new annotation matrix to be used to train the model to recognize new annotations. We tested the effectiveness of our approach on nine Gene Ontology annotation datasets. Obtained results demonstrated that our technique is able to improve novel annotation predictions with respect to state of the art unsupervised methods.

[1]  S. T. Dumais,et al.  Using latent semantic analysis to improve access to textual information , 1988, CHI '88.

[2]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[3]  G. Casella,et al.  Explaining the Gibbs Sampler , 1992 .

[4]  Marco Masseroli,et al.  Enhanced probabilistic latent semantic analysis with weighting schemes to predict genomic annotations , 2013, 13th IEEE International Conference on BioInformatics and BioEngineering.

[5]  Robert E. Schapire,et al.  Hierarchical multi-label prediction of gene function , 2006, Bioinform..

[6]  Marco Masseroli,et al.  Weighting Scheme Methods for Enhanced Genomic Annotation Prediction , 2013, CIBB.

[7]  Marco Masseroli,et al.  Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[8]  S. Dwight,et al.  Predicting gene function from patterns of annotation. , 2003, Genome research.

[9]  Marco Masseroli,et al.  A discrete optimization approach for SVD best truncation choice based on ROC curves , 2013, 13th IEEE International Conference on BioInformatics and BioEngineering.

[10]  Miguel A. Andrade-Navarro,et al.  Gene annotation from scientific literature using mappings between keyword systems , 2004, Bioinform..

[11]  Purvesh Khatri,et al.  A semantic analysis of the annotations of the human genome , 2005, Bioinform..

[12]  Marco Masseroli,et al.  Latent Dirichlet Allocation based on Gibbs Sampling for gene function prediction , 2014, 2014 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology.

[13]  Purvesh Khatri,et al.  Predicting Novel Human Gene Ontology Annotations Using Semantic Analysis , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[14]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[15]  Carol Friedman,et al.  Information theory applied to the sparse gene ontology annotation network to predict novel gene function , 2007, ISMB/ECCB.

[16]  Jeffrey T. Chang,et al.  Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. , 2002, Genome research.

[17]  J. Blake,et al.  Creating the Gene Ontology Resource : Design and Implementation The Gene Ontology Consortium 2 , 2001 .

[18]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[19]  Purvesh Khatri,et al.  Semantic Analysis of Genome Annotations using Weighting Schemes , 2007, 2007 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology.

[20]  Marco Masseroli,et al.  Integration of Biomolecular Interaction Data in a Genomic and Proteomic Data Warehouse to Support Biomedical Knowledge Discovery , 2011, CIBB.

[21]  Alessandro Perina,et al.  Biologically-aware Latent Dirichlet Allocation (BaLDA) for the Classification of Expression Microarray , 2010, PRIB.

[22]  Masatoshi Yoshikawa,et al.  The GeneAround GO viewer , 2002, Bioinform..

[23]  Alessandro Perina,et al.  Expression microarray classification using topic models , 2010, SAC '10.

[24]  Marco Tagliasacchi,et al.  Genomic Annotation Prediction Based on Integrated Information , 2011, CIBB.