Quantitative biomedical annotation using medical subject heading over-representation profiles (MeSHOPs)

BackgroundMEDLINE®/PubMed® indexes over 20 million biomedical articles, providing curated annotation of its contents using a controlled vocabulary known as Medical Subject Headings (MeSH). The MeSH vocabulary, developed over 50+ years, provides a broad coverage of topics across biomedical research. Distilling the essential biomedical themes for a topic of interest from the relevant literature is important to both understand the importance of related concepts and discover new relationships.ResultsWe introduce a novel method for determining enriched curator-assigned MeSH annotations in a set of papers associated to a topic, such as a gene, an author or a disease. We generate MeSH Over-representation Profiles (MeSHOPs) to quantitatively summarize the annotations in a form convenient for further computational analysis and visualization. Based on a hypergeometric distribution of assigned terms, MeSHOPs statistically account for the prevalence of the associated biomedical annotation while highlighting unusually prevalent terms based on a specified background. MeSHOPs can be visualized using word clouds, providing a succinct quantitative graphical representation of the relative importance of terms. Using the publication dates of articles, MeSHOPs track changing patterns of annotation over time. Since MeSHOPs are quantitative vectors, MeSHOPs can be compared using standard techniques such as hierarchical clustering. The reliability of MeSHOP annotations is assessed based on the capacity to re-derive the subset of the Gene Ontology annotations with equivalent MeSH terms.ConclusionsMeSHOPs allows quantitative measurement of the degree of association between any entity and the annotated medical concepts, based directly on relevant primary literature. Comparison of MeSHOPs allows entities to be related based on shared medical themes in their literature. A web interface is provided for generating and visualizing MeSHOPs.

[1]  Mark A. Musen,et al.  Enabling enrichment analysis with the Human Disease Ontology , 2011, J. Biomed. Informatics.

[2]  Mark A Musen,et al.  An ontology-neutral framework for enrichment analysis. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[3]  Eleanor Howe,et al.  MeSHer: identifying biological concepts in microarray assays based on PubMed references and MeSH terms , 2005, Bioinform..

[4]  Adam D. Schuyler,et al.  SciMiner: web-based literature mining tool for target identification and functional enrichment analysis , 2009, Bioinform..

[5]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[6]  A MusenMark,et al.  Enabling enrichment analysis with the Human Disease Ontology , 2011 .

[7]  B. Mayer Bioinformatics for omics data : methods and protocols , 2011 .

[8]  Catherine N. Norton,et al.  LigerCat: Using "MeSH Clouds" from Journal, Article, or Gene Citations to Facilitate the Identification of Relevant Biomedical Literature , 2009, AMIA.

[9]  R. Durbin,et al.  A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. , 1995, Gene.

[10]  David A. Hanauer,et al.  Exploring Clinical Associations Using ‘-Omics’ Based Enrichment Analyses , 2009, PloS one.

[11]  A. Valencia,et al.  A gene network for navigating the literature , 2004, Nature Genetics.

[12]  Indra Neil Sarkar,et al.  Literature Based Discovery of Gene Clusters Using Phylogenetic Methods , 2006, AMIA.

[13]  Brad T. Sherman,et al.  DAVID: Database for Annotation, Visualization, and Integrated Discovery , 2003, Genome Biology.

[14]  Hideo Matsuda,et al.  BioCompass: A Novel Functional Inference Tool that Utilizes MeSH Hierarchy to Analyze Groups of Genes , 2008, Silico Biol..

[15]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[16]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[17]  Mark D. Smucker,et al.  Information Retrieval , 2017, Lecture Notes in Computer Science.

[18]  Mei Li,et al.  MultiPipMaker and supporting tools: alignments and analysis of multiple genomic DNA sequences , 2003, Nucleic Acids Res..

[19]  Hideo Matsuda,et al.  Gendoo: Functional profiling of gene and disease features using MeSH vocabulary , 2009, Nucleic Acids Res..

[20]  Lynette Hirschman,et al.  Knowledge Acquisition from the Biomedical Literature , 2007 .

[21]  David B. Searls,et al.  Can literature analysis identify innovation drivers in drug discovery? , 2009, Nature Reviews Drug Discovery.

[22]  Benjamin M. Good,et al.  Mining the Gene Wiki for functional genomic knowledge , 2011, BMC Genomics.

[23]  Pankaj K. Agarwal,et al.  Scientific literature mining for drug discovery: a case study on obesity , 2011 .

[24]  Avi Ma'ayan,et al.  Genes2WordCloud: a quick way to identify biological themes from gene lists and free text , 2011, Source Code for Biology and Medicine.

[25]  Shannan J. Ho Sui,et al.  oPOSSUM: integrated tools for analysis of regulatory motif over-representation , 2007, Nucleic Acids Res..

[26]  B. Snel,et al.  STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. , 2000, Nucleic acids research.

[27]  Fabian J. Theis,et al.  Advances in Computational Biology , 2010 .

[28]  David B. Searls,et al.  Literature mining in support of drug discovery , 2008, Briefings Bioinform..

[29]  Gary L. Argraves,et al.  GeneMesh: a web-based microarray analysis tool for relating differentially expressed genes to MeSH terms , 2010, BMC Bioinformatics.

[30]  Pan Du,et al.  Visual presentation as a welcome alternative to textual presentation of gene annotation information. , 2010, Advances in experimental medicine and biology.

[31]  Martin Vingron,et al.  Improved detection of overrepresentation of Gene-Ontology annotations with parent-child analysis , 2007, Bioinform..

[32]  Vinod Kumar,et al.  Omics and literature mining. , 2011, Methods in molecular biology.

[33]  Steven J. M. Jones,et al.  Circos: an information aesthetic for comparative genomics. , 2009, Genome research.

[34]  Laurent Mouchard,et al.  A fast and efficient algorithm for mapping short sequences to a reference genome. , 2010, Advances in experimental medicine and biology.