Latent Dirichlet Allocation based on Gibbs Sampling for gene function prediction

Gene function annotations are key elements in biology and bioinformatics. A typical annotation is the association between a gene and a feature term that describes a functional feature of the gene by using a controlled vocabulary term (e.g. a Gene Ontology (GO) feature term). Unfortunately, available annotations contain errors and biologically validated ones are incomplete by definition, since new knowledge is continuously discovered. Thus, computational algorithms which are able to provide ranked lists of predicted new gene annotations are an excellent contribution to the bioinformatics research. Here, we propose two variants of the known Latent Dirichlet Allocation (LDA) algorithm applied to the prediction of gene annotations. LDA is a very efficient machine learning method built on a set of multinomial probability distributions over a set of topics, given a document (a gene, in our case), and on a set of multinomial probability distributions over a set of words (feature terms, in our case), given a topic. In topic modeling, a topic can be considered as a latent meta-category of words, and a document as a mixture of topics. Our two LDA variants use the collapsed Gibbs Sampling method during the training phase, with two distinct initialization approaches to adapt the LDA mathematical model to the biomolecular annotation scenario. Using six outdated datasets of GO annotations of human and brown rat genes, we compared the annotations predicted by our methods to the ones given by the truncated Singular Value Decomposition (tSVD) method previously developed; then, we validated them by using the annotations available in an updated version of the same datasets. Obtained results show the efficiency of our new proposed algorithms.

[1]  Alessandro Perina,et al.  Expression microarray classification using topic models , 2010, SAC '10.

[2]  Antonino Fiannaca,et al.  Genomic Sequence Classification Using Probabilistic Topic Modeling , 2013, CIBB.

[3]  Gaurav Pandey,et al.  Computational Approaches for Protein Function Prediction : A Survey , 2006 .

[4]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[5]  J. Blake,et al.  Creating the Gene Ontology Resource : Design and Implementation The Gene Ontology Consortium 2 , 2001 .

[6]  Miguel A. Andrade-Navarro,et al.  Gene annotation from scientific literature using mappings between keyword systems , 2004, Bioinform..

[7]  Marco Tagliasacchi,et al.  Genomic Annotation Prediction Based on Integrated Information , 2011, CIBB.

[8]  Marco Masseroli,et al.  Bio-SeCo: Integration and Global Ranking of Biomedical Search Results , 2010, SeCO Workshop.

[9]  Alessandro Campi,et al.  Integrative warehousing of biomolecular information to support complex multi-topic queries for biomedical knowledge discovery , 2013, 13th IEEE International Conference on BioInformatics and BioEngineering.

[10]  S. Dwight,et al.  Predicting gene function from patterns of annotation. , 2003, Genome research.

[11]  Purvesh Khatri,et al.  Predicting Novel Human Gene Ontology Annotations Using Semantic Analysis , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[12]  Susan T. Dumais,et al.  Improving information retrieval using latent semantic indexing , 1988 .

[13]  M. Narasimha Murty,et al.  On Finding the Natural Number of Topics with Latent Dirichlet Allocation: Some Observations , 2010, PAKDD.

[14]  Marco Masseroli,et al.  Integration of Biomolecular Interaction Data in a Genomic and Proteomic Data Warehouse to Support Biomedical Knowledge Discovery , 2011, CIBB.

[15]  Purvesh Khatri,et al.  Semantic Analysis of Genome Annotations using Weighting Schemes , 2007, 2007 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology.

[16]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[17]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval , 2008, NAACL.

[18]  Marco Masseroli,et al.  Improved Biomolecular Annotation Prediction through Weighting Scheme Methods , 2013 .

[19]  Robert E. Schapire,et al.  Hierarchical multi-label prediction of gene function , 2006, Bioinform..

[20]  Purvesh Khatri,et al.  A semantic analysis of the annotations of the human genome , 2005, Bioinform..

[21]  Marco Masseroli,et al.  A discrete optimization approach for SVD best truncation choice based on ROC curves , 2013, 13th IEEE International Conference on BioInformatics and BioEngineering.

[22]  Alessandro Perina,et al.  Biologically-aware Latent Dirichlet Allocation (BaLDA) for the Classification of Expression Microarray , 2010, PRIB.

[23]  Yi-Cheng Zhang,et al.  Recommender Systems , 2012, ArXiv.

[24]  Marco Masseroli,et al.  Enhanced probabilistic latent semantic analysis with weighting schemes to predict genomic annotations , 2013, 13th IEEE International Conference on BioInformatics and BioEngineering.

[25]  Davide Chicco Integration of Bioinformatics Web Services through the Search Computing Technology MINOR RESEARCH REPORT , 2012 .

[26]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[27]  Carol Friedman,et al.  Information theory applied to the sparse gene ontology annotation network to predict novel gene function , 2007, ISMB/ECCB.

[28]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[29]  Dino Pedreschi,et al.  A classification for community discovery methods in complex networks , 2011, Stat. Anal. Data Min..

[30]  Jeffrey T. Chang,et al.  Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. , 2002, Genome research.

[31]  G. Casella,et al.  Explaining the Gibbs Sampler , 1992 .

[32]  Marco Masseroli,et al.  Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).