Anticipating annotations and emerging trends in biomedical literature

The BioJournalMonitor is a decision support system for the analysis of trends and topics in the biomedical literature. Its main goal is to identify potential diagnostic and therapeutic biomarkers for specific diseases. Several data sources are continuously integrated to provide the user with up-to-date information on current research in this field. State-of-the-art text mining technologies are deployed to provide added value on top of the original content, including named entity detection, relation extraction, classification, clustering, ranking, summarization, and visualization. We present two novel technologies that are related to the analysis of temporal dynamics of text archives and associated ontologies. Currently, the MeSH ontology is used to annotate the scientific articles entering the PubMed database with medical terms. Both the maintenance of the ontology as well as the annotation of new articles is performed largely manually. We describe how probabilistic topic models can be used to annotate recent articles with the most likely MeSH terms. This provides our users with a competitive advantage because, when searching for MeSH terms, articles are found long before they are manually annotated. We further present a study on how to predict the inclusion of new terms in the MeSH ontology. The results suggest that early prediction of emerging trends is possible. The trend ranking functions are deployed in our system to enable interactive searches for the hottest new trends relating to a disease.

[1]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[2]  Klaus Brinker,et al.  Any-time clustering of high frequency news streams , 2007 .

[3]  Thomas C. Rindflesch,et al.  Multiple Approaches to Fine-Grained Indexing of the Biomedical Literature , 2006, Pacific Symposium on Biocomputing.

[4]  Michael I. Jordan,et al.  Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span , 2006, BMC Bioinformatics.

[5]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[6]  Andrew McCallum,et al.  Topic and Role Discovery in Social Networks , 2005, IJCAI.

[7]  Qi He,et al.  Bursty Feature Representation for Clustering Text Streams , 2007, SDM.

[8]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[10]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[11]  James Allan,et al.  Automatic generation of overview timelines , 2000, SIGIR '00.

[12]  Myra Spiliopoulou,et al.  Discovering Emerging Topics in Unlabelled Text Collections , 2006, ADBIS.

[13]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[14]  Alan R. Aronson,et al.  Fine-Grained Indexing of the Biomedical Literature: MeSH Subheading Attachment for a MEDLINE Indexing Tool , 2007, AMIA.

[15]  Alan R. Aronson,et al.  Semi-Automatic Indexing of Full Text Biomedical Articles , 2005, AMIA.

[16]  Hector Garcia-Molina,et al.  Overview of multidatabase transaction management , 2005, The VLDB Journal.

[17]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[18]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[19]  Padhraic Smyth,et al.  Statistical entity-topic models , 2006, KDD '06.

[20]  Ramakrishnan Srikant,et al.  Discovering Trends in Text Databases , 1997, KDD.

[21]  ChengXiang Zhai,et al.  Discovering evolutionary theme patterns from text: an exploration of temporal text mining , 2005, KDD '05.

[22]  Satoshi Morinaga,et al.  Tracking dynamics of topic trends using a finite mixture model , 2004, KDD.

[23]  Philip S. Yu,et al.  Parameter Free Bursty Events Detection in Text Streams , 2005, VLDB.

[24]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[25]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[26]  Volker Tresp,et al.  Statistical modeling of medical indexing processes for biomedical knowledge information discovery from text , 2008 .