Identifying overrepresented concepts in gene lists from literature: a statistical approach based on Poisson mixture model

BackgroundLarge-scale genomic studies often identify large gene lists, for example, the genes sharing the same expression patterns. The interpretation of these gene lists is generally achieved by extracting concepts overrepresented in the gene lists. This analysis often depends on manual annotation of genes based on controlled vocabularies, in particular, Gene Ontology (GO). However, the annotation of genes is a labor-intensive process; and the vocabularies are generally incomplete, leaving some important biological domains inadequately covered.ResultsWe propose a statistical method that uses the primary literature, i.e. free-text, as the source to perform overrepresentation analysis. The method is based on a statistical framework of mixture model and addresses the methodological flaws in several existing programs. We implemented this method within a literature mining system, BeeSpace, taking advantage of its analysis environment and added features that facilitate the interactive analysis of gene sets. Through experimentation with several datasets, we showed that our program can effectively summarize the important conceptual themes of large gene sets, even when traditional GO-based analysis does not yield informative results.ConclusionsWe conclude that the current work will provide biologists with a tool that effectively complements the existing ones for overrepresentation analysis from genomic experiments. Our program, Genelist Analyzer, is freely available at: http://workerbee.igb.uiuc.edu:8080/BeeSpace/Search.jsp

[1]  D. Chaussabel,et al.  Mining microarray expression data by literature profiling , 2002, Genome Biology.

[2]  Hagit Shatkay,et al.  Discovering semantic features in the literature: a foundation for building functional associations , 2006, BMC Bioinformatics.

[3]  Francisco Tirado,et al.  SENT: semantic features in text , 2009, Nucleic Acids Res..

[4]  David Kipling,et al.  Text-based over-representation analysis of microarray gene lists with annotation bias , 2009, Nucleic acids research.

[5]  D. Denlinger,et al.  Temporal expression patterns of diapause-associated genes in flesh fly pupae from the onset of diapause through post-diapause quiescence. , 2005, Journal of insect physiology.

[6]  L. Grivell,et al.  Text mining for biology - the way forward: opinions from leading scientists , 2008, Genome Biology.

[7]  David Martin,et al.  GOToolBox: functional analysis of gene datasets based on Gene Ontology , 2004, Genome Biology.

[8]  M. Schuemie,et al.  Anni 2.0: a multipurpose text-mining tool for the life sciences , 2008, Genome Biology.

[9]  Ralf Zimmer,et al.  Expert knowledge without the expert: integrated analysis of gene expression and literature to derive active functional contexts , 2005, ECCB/JBI.

[10]  Daniel Hanisch,et al.  ProMiner: rule-based protein and gene entity recognition , 2005, BMC Bioinformatics.

[11]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Gene E Robinson,et al.  Species differences in brain gene expression profiles associated with adult behavioral maturation in honey bees , 2007, BMC Genomics.

[13]  Ronald D Vale,et al.  The Molecular Motor Toolbox for Intracellular Transport , 2003, Cell.

[14]  Hideharu Numata,et al.  Gene expression of heat-shock proteins (Hsp23, Hsp70 and Hsp90) during and after larval diapause in the blow fly Lucilia sericata. , 2005, Journal of insect physiology.

[15]  May D. Wang,et al.  GoMiner: a resource for biological interpretation of genomic and proteomic data , 2003, Genome Biology.

[16]  John G. Cleary,et al.  AZuRE, a scalable system for automated term disambiguation of gene and protein names , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[17]  A. Valencia,et al.  Mining functional information associated with expression arrays , 2001, Functional & Integrative Genomics.

[18]  R C Grimson,et al.  Clustering of rare events. , 1983, Biometrics.

[19]  Douglas A. Hosack,et al.  Identifying biological themes within lists of genes with EASE , 2003, Genome Biology.

[20]  D. Eisenberg,et al.  Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Barend Mons,et al.  Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation , 2007, BMC Bioinformatics.

[22]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology , 2004, Nucleic Acids Res..

[23]  G. Robinson,et al.  Stimulation of muscarinic receptors mimics experience-dependent plasticity in the honey bee brain. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Hongyuan Zha,et al.  Computational Statistics Data Analysis , 2021 .

[25]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[26]  Michael R. Seringhaus,et al.  Seeking a New Biology through Text Mining , 2008, Cell.

[27]  Eleanor Howe,et al.  MeSHer: identifying biological concepts in microarray assays based on PubMed references and MeSH terms , 2005, Bioinform..

[28]  S. Jazwinski,et al.  The retrograde response links metabolism with stress responses, chromatin-dependent gene activation, and genome stability in yeast aging. , 2005, Gene.

[29]  BMC Bioinformatics , 2005 .

[30]  Michael W. Berry,et al.  Gene clustering by Latent Semantic Indexing of MEDLINE abstracts , 2005, Bioinform..

[31]  Gene E. Robinson,et al.  Genomic dissection of behavioral maturation in the honey bee , 2006, Proceedings of the National Academy of Sciences.

[32]  Itamar Simon,et al.  MILANO – custom annotation of microarray results using automatic literature searches , 2005, BMC Bioinformatics.

[33]  B. De Moor,et al.  TXTGate: profiling gene groups with text-based information , 2004, Genome Biology.

[34]  R. Altman,et al.  Using text analysis to identify functionally coherent gene groups. , 2002, Genome research.

[35]  Nobutaka Hirokawa,et al.  Molecular motors in neuronal development, intracellular transport and diseases , 2004, Current Opinion in Neurobiology.

[36]  Satanjeev Banerjee,et al.  The Design, Implementation, and Use of the Ngram Statistics Package , 2003, CICLing.

[37]  Xin He,et al.  Automatically Generating Gene Summaries from Biomedical Literature , 2005, Pacific Symposium on Biocomputing.

[38]  J. Hopfield,et al.  From molecular to modular cell biology , 1999, Nature.

[39]  M. J. Teixeira de Mattos,et al.  Regulation of transcription by Saccharomyces cerevisiae 14-3-3 proteins. , 2004, The Biochemical journal.

[40]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[41]  Christopher C. Moser,et al.  Natural engineering principles of electron tunnelling in biological oxidation–reduction , 1999, Nature.