Topic modeling of biomedical text

The massive growth of biomedical text makes it very challenging for researchers to review all relevant work and generate all possible hypotheses in a reasonable amount of time. Many text mining methods have been developed to simplify this process and quickly present the researcher with a learned set of biomedical hypotheses that could be potentially validated. Previously, we have focused on the task of identifying genes that are linked with a given disease by text mining the PubMed abstracts. We applied a word-based concept profile similarity to learn patterns between disease and gene entities and hence identify links between them. In this work, we study an alternative approach based on topic modelling to learn different patterns between the disease and the gene entities and measure how well this affects the identified links. We investigated multiple input corpuses, word representations, topic parameters, and similarity measures. On one hand, our results show that when we (1) learn the topics from an input set of gene-clustered set of abstracts, and (2) apply the dot-product similarity measure, we succeed to improve our original methods and identify more correct disease-gene links. On the other hand, the results also show that the learned topics remain limited to the diseases existing in our vocabulary such that scaling the methodology to new disease queries becomes non trivial.

[1]  Joyce A. Mitchell,et al.  Gene Indexing: Characterization and Analysis of NLM's GeneRIFs , 2003, AMIA.

[2]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[3]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Carol Friedman,et al.  Exploiting Semantic Relations for Literature-Based Discovery , 2006, AMIA.

[5]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[6]  Elmer V. Bernstam,et al.  A day in the life of PubMed: analysis of a typical day's query log. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[7]  Martijn J. Schuemie,et al.  Literature-based concept profiles for gene annotation: The issue of weighting , 2008, Int. J. Medical Informatics.

[8]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[9]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[10]  Miguel A. Andrade-Navarro,et al.  Génie: literature-based gene prioritization at multi genomic scale , 2011, Nucleic Acids Res..

[11]  Xiaowei Xu,et al.  Mining FDA drug labels using an unsupervised learning technique - topic modeling , 2011, BMC Bioinformatics.

[12]  Jacob de Vlieg,et al.  CoPub update: CoPub 5.0 a text mining system to answer biological questions , 2011, Nucleic Acids Res..

[13]  Carol A. Bocchini,et al.  A new face and new challenges for Online Mendelian Inheritance in Man (OMIM®) , 2011, Human mutation.

[14]  W. Wasserman,et al.  Inferring novel gene-disease associations using Medical Subject Heading Over-representation Profiles , 2012, Genome Medicine.

[15]  Hua Xu,et al.  Ranking Gene-Drug Relationships in Biomedical Literature Using Latent Dirichlet Allocation , 2011, Pacific Symposium on Biocomputing.

[16]  Antonino Fiannaca,et al.  Probabilistic topic modeling for the analysis and classification of genomic sequences , 2015, BMC Bioinformatics.

[17]  Núria Queralt-Rosinach,et al.  DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes , 2015, Database J. Biol. Databases Curation.

[18]  Jesse Davis,et al.  A Comprehensive Comparison of Two MEDLINE Annotators for Disease and Gene Linkage: Sometimes Less is More , 2016, IWBBIO.

[19]  Lars Juhl Jensen,et al.  EXTRACT: interactive extraction of environment metadata and term suggestion for metagenomic sample annotation , 2016, Database J. Biol. Databases Curation.

[20]  Y. Moreau,et al.  Beegle: from literature mining to disease-gene discovery , 2015, Nucleic acids research.