Ontology-Based Prediction and Prioritization of Gene Functional Annotations

Genes and their protein products are essential molecular units of a living organism. The knowledge of their functions is key for the understanding of physiological and pathological biological processes, as well as in the development of new drugs and therapies. The association of a gene or protein with its functions, described by controlled terms of biomolecular terminologies or ontologies, is named gene functional annotation. Very many and valuable gene annotations expressed through terminologies and ontologies are available. Nevertheless, they might include some erroneous information, since only a subset of annotations are reviewed by curators. Furthermore, they are incomplete by definition, given the rapidly evolving pace of biomolecular knowledge. In this scenario, computational methods that are able to quicken the annotation curation process and reliably suggest new annotations are very important. Here, we first propose a computational pipeline that uses different semantic and machine learning methods to predict novel ontology-based gene functional annotations; then, we introduce a new semantic prioritization rule to categorize the predicted annotations by their likelihood of being correct. Our tests and validations proved the effectiveness of our pipeline and prioritization of predicted annotations, by selecting as most likely manifold predicted annotations that were later confirmed.

[1]  David Bryant,et al.  DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists , 2007, Nucleic Acids Res..

[2]  Anushya Muruganujan,et al.  PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees , 2012, Nucleic Acids Res..

[3]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[4]  Marco Masseroli,et al.  Improved Biomolecular Annotation Prediction through Weighting Scheme Methods , 2013 .

[5]  Hesham H. Ali,et al.  A hidden Markov model for gene function prediction from sequential expression data , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[6]  Marco Masseroli,et al.  Visual Composition of Complex Queries on an Integrative Genomic and Proteomic Data Warehouse , 2013 .

[7]  Blaz Zupan,et al.  Text Mining approaches for Automated Literature Knowledge Extraction and Representation , 2010, MedInfo.

[8]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[9]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[10]  Gary D. Bader,et al.  The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function , 2010, Nucleic Acids Res..

[11]  Pierre Baldi,et al.  Deep autoencoder neural networks for gene ontology annotation predictions , 2014, BCB.

[12]  Subbarao Kambhampati,et al.  Integration of biological sources: current systems and challenges ahead , 2004, SGMD.

[13]  Nicholas Mitsakakis,et al.  Prediction of Drosophila melanogaster gene function using Support Vector Machines , 2013, BioData Mining.

[14]  Alessandro Campi,et al.  Integrative warehousing of biomolecular information to support complex multi-topic queries for biomedical knowledge discovery , 2013, 13th IEEE International Conference on BioInformatics and BioEngineering.

[15]  Marco Masseroli,et al.  Explorative search of distributed bio-data to answer complex biomedical questions , 2014, BMC Bioinformatics.

[16]  Walter V. Sujansky,et al.  Heterogeneous Database Integration in Biomedicine , 2001, J. Biomed. Informatics.

[17]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[18]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[19]  Michael Y. Galperin,et al.  The 2015 Nucleic Acids Research Database Issue and Molecular Biology Database Collection , 2014, Nucleic Acids Res..

[20]  Marco Tagliasacchi,et al.  Semantically improved genome-wide prediction of Gene Ontology annotations , 2011, 2011 11th International Conference on Intelligent Systems Design and Applications.

[21]  Davide Chicco Integration of Bioinformatics Web Services through the Search Computing Technology MINOR RESEARCH REPORT , 2012 .

[22]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[23]  Marco Masseroli,et al.  Management and Analysis of Genomic Functional and Phenotypic Controlled Annotations to Support Biomedical Investigation and Practice , 2007, IEEE Transactions on Information Technology in Biomedicine.

[24]  Purvesh Khatri,et al.  Predicting Novel Human Gene Ontology Annotations Using Semantic Analysis , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[25]  Hsinchun Chen,et al.  Graph Kernel-Based Learning for Gene Function Prediction from Gene Interaction Network , 2007, 2007 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2007).

[26]  Marco Masseroli,et al.  Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[27]  Alan M. Frieze,et al.  Clustering Large Graphs via the Singular Value Decomposition , 2004, Machine Learning.

[28]  Carol Friedman,et al.  Information theory applied to the sparse gene ontology annotation network to predict novel gene function , 2007, ISMB/ECCB.

[29]  Hai Hu,et al.  Assessing semantic similarity measures for the characterization of human regulatory pathways , 2006, Bioinform..

[30]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[31]  John Quackenbush Microarrays--Guilt by Association , 2003, Science.

[32]  Purvesh Khatri,et al.  Semantic Analysis of Genome Annotations using Weighting Schemes , 2007, 2007 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology.

[33]  Gerhard Tröster,et al.  Automatic Identification of Temporal Sequences in Chewing Sounds , 2007, 2007 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2007).

[34]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[35]  Damiano Piovesan,et al.  FFPred 2.0: Improved Homology-Independent Prediction of Gene Ontology Terms for Eukaryotic Protein Sequences , 2013, PloS one.

[36]  S. Dwight,et al.  Predicting gene function from patterns of annotation. , 2003, Genome research.

[37]  Marco Tagliasacchi,et al.  Web Resources for Gene List Analysis in Biomedicine , 2010, Web-Based Applications in Healthcare and Biomedicine.

[38]  Marco Tagliasacchi,et al.  Anomaly-free Prediction of Gene Ontology Annotations Using Bayesian Networks , 2009, 2009 Ninth IEEE International Conference on Bioinformatics and BioEngineering.

[39]  Jesús S. Aguilar-Ruiz,et al.  GO-based Functional Dissimilarity of Gene Sets , 2011, BMC Bioinformatics.

[40]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[41]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[42]  Robert E. Schapire,et al.  Hierarchical multi-label prediction of gene function , 2006, Bioinform..

[43]  Thomas Lengauer,et al.  A new measure for functional similarity of gene products based on Gene Ontology , 2006, BMC Bioinformatics.

[44]  Marco Masseroli,et al.  Enhanced probabilistic latent semantic analysis with weighting schemes to predict genomic annotations , 2013, 13th IEEE International Conference on BioInformatics and BioEngineering.

[45]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[46]  Marco Masseroli,et al.  A discrete optimization approach for SVD best truncation choice based on ROC curves , 2013, 13th IEEE International Conference on BioInformatics and BioEngineering.

[47]  Jesse Gillis,et al.  Progress and challenges in the computational prediction of gene function using networks , 2012, F1000Research.

[48]  Marco Masseroli,et al.  Latent Dirichlet Allocation based on Gibbs Sampling for gene function prediction , 2014, 2014 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology.

[49]  Gaurav Pandey,et al.  Computational Approaches for Protein Function Prediction : A Survey , 2006 .

[50]  J. Blake,et al.  Creating the Gene Ontology Resource : Design and Implementation The Gene Ontology Consortium 2 , 2001 .

[51]  Xiaoyu Jiang,et al.  Combining Hierarchical Inference in Ontologies with Heterogeneous Data Sources Improves Gene Function Prediction , 2008, 2008 IEEE International Conference on Bioinformatics and Biomedicine.

[52]  Francesco Pinciroli,et al.  GFINDer: Genome Function INtegrated Discoverer through dynamic annotation, statistical analysis, and mining , 2004, Nucleic Acids Res..

[53]  Purvesh Khatri,et al.  A semantic analysis of the annotations of the human genome , 2005, Bioinform..

[54]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[55]  Joaquín Dopazo,et al.  The role of the environment in Parkinson's disease. , 1996, Nucleic Acids Res..