Random Perturbations of Term Weighted Gene Ontology Annotations for Discovering Gene Unknown Functionalities

Computational analyses for biomedical knowledge discovery greatly benefit from the availability of the description of gene and protein functional features expressed through controlled terminologies and ontologies, i.e. of their controlled annotations. In the last years, several databases of such annotations have become available; yet, these annotations are incomplete and only some of them represent highly reliable human curated information. To predict and discover unknown or missing annotations existing approaches use unsupervised learning algorithms. We propose a new learning method that allows applying supervised algorithms to unsupervised problems, achieving much better annotation predictions. This method, which we also extend from our preceding work with data weighting techniques, is based on the generation of artificial labeled training sets through random perturbations of original data. We tested it on nine Gene Ontology annotation datasets; obtained results demonstrate that our approach achieves good effectiveness in novel annotation prediction, outperforming state of the art unsupervised methods.

[1]  S. Dwight,et al.  Predicting gene function from patterns of annotation. , 2003, Genome research.

[2]  Purvesh Khatri,et al.  A semantic analysis of the annotations of the human genome , 2005, Bioinform..

[3]  Marco Masseroli,et al.  A discrete optimization approach for SVD best truncation choice based on ROC curves , 2013, 13th IEEE International Conference on BioInformatics and BioEngineering.

[4]  Miguel A. Andrade-Navarro,et al.  Gene annotation from scientific literature using mappings between keyword systems , 2004, Bioinform..

[5]  Robert E. Schapire,et al.  Hierarchical multi-label prediction of gene function , 2006, Bioinform..

[6]  Peter Willett,et al.  Document Retrieval Systems , 1988 .

[7]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[8]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[10]  Marco Masseroli,et al.  Weighting Scheme Methods for Enhanced Genomic Annotation Prediction , 2013, CIBB.

[11]  Marco Masseroli,et al.  Discovering New Gene Functionalities from Random Perturbations of Known Gene Ontological Annotations , 2014, KDIR.

[12]  Marco Masseroli,et al.  Latent Dirichlet Allocation based on Gibbs Sampling for gene function prediction , 2014, 2014 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology.

[13]  J. Blake,et al.  Creating the Gene Ontology Resource : Design and Implementation The Gene Ontology Consortium 2 , 2001 .

[14]  Claudio Sartori,et al.  Iterative Refining of Category Profiles for Nearest Centroid Cross-Domain Text Classification , 2014, IC3K.

[15]  Purvesh Khatri,et al.  Semantic Analysis of Genome Annotations using Weighting Schemes , 2007, 2007 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology.

[16]  Carol Friedman,et al.  Information theory applied to the sparse gene ontology annotation network to predict novel gene function , 2007, ISMB/ECCB.

[17]  Purvesh Khatri,et al.  Predicting Novel Human Gene Ontology Annotations Using Semantic Analysis , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18]  Jeffrey T. Chang,et al.  Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. , 2002, Genome research.

[19]  Alessandro Perina,et al.  Expression microarray classification using topic models , 2010, SAC '10.

[20]  Claudio Sartori,et al.  Cross-domain Text Classification through Iterative Refining of Target Categories Representations , 2014, KDIR.

[21]  Marco Masseroli,et al.  Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[22]  G. Casella,et al.  Explaining the Gibbs Sampler , 1992 .

[23]  Marco Masseroli,et al.  Enhanced probabilistic latent semantic analysis with weighting schemes to predict genomic annotations , 2013, 13th IEEE International Conference on BioInformatics and BioEngineering.

[24]  Marco Tagliasacchi,et al.  Genomic Annotation Prediction Based on Integrated Information , 2011, CIBB.

[25]  Marco Masseroli,et al.  Integration of Biomolecular Interaction Data in a Genomic and Proteomic Data Warehouse to Support Biomedical Knowledge Discovery , 2011, CIBB.

[26]  Alessandro Perina,et al.  Biologically-aware Latent Dirichlet Allocation (BaLDA) for the Classification of Expression Microarray , 2010, PRIB.

[27]  S. T. Dumais,et al.  Using latent semantic analysis to improve access to textual information , 1988, CHI '88.

[28]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[29]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[30]  Masatoshi Yoshikawa,et al.  The GeneAround GO viewer , 2002, Bioinform..