MeSH Up: effective MeSH text classification for improved document retrieval

MOTIVATION Controlled vocabularies such as the Medical Subject Headings (MeSH) thesaurus and the Gene Ontology (GO) provide an efficient way of accessing and organizing biomedical information by reducing the ambiguity inherent to free-text data. Different methods of automating the assignment of MeSH concepts have been proposed to replace manual annotation, but they are either limited to a small subset of MeSH or have only been compared with a limited number of other systems. RESULTS We compare the performance of six MeSH classification systems [MetaMap, EAGL, a language and a vector space model-based approach, a K-Nearest Neighbor (KNN) approach and MTI] in terms of reproducing and complementing manual MeSH annotations. A KNN system clearly outperforms the other published approaches and scales well with large amounts of text using the full MeSH thesaurus. Our measurements demonstrate to what extent manual MeSH annotations can be reproduced and how they can be complemented by automatic annotations. We also show that a statistically significant improvement can be obtained in information retrieval (IR) when the text of a user's query is automatically annotated with MeSH concepts, compared to using the original textual query alone. CONCLUSIONS The annotation of biomedical texts using controlled vocabularies such as MeSH can be automated to improve text-only IR. Furthermore, the automatic MeSH annotation system we propose is highly scalable and it generates improvements in IR comparable with those observed for manual annotations.

[1]  Patrick Ruch,et al.  Automatic assignment of biomedical categories: toward a generic approach , 2006, Bioinform..

[2]  W. Bruce Croft,et al.  Cross-lingual relevance models , 2002, SIGIR '02.

[3]  Bart De Moor,et al.  Comparison of vocabularies, representations and ranking algorithms for gene prioritization by text mining , 2008, ECCB.

[4]  Jacques Savoy,et al.  Searching in Medline: Query expansion and manual indexing evaluation , 2008, Inf. Process. Manag..

[5]  Susanne M. Humphrey,et al.  The NLM Indexing Initiative's Medical Text Indexer , 2004, MedInfo.

[6]  Wai Lam,et al.  Automatic Text Categorization and Its Application to Text Retrieval , 1999, IEEE Trans. Knowl. Data Eng..

[7]  Goran Nenadic,et al.  Mining Biomedical Abstracts: What's in a Term? , 2004, IJCNLP.

[8]  Wai Lam,et al.  Using a generalized instance set for automatic text categorization , 1998, SIGIR '98.

[9]  Alan F. Smeaton,et al.  On Combining Text and MeSH Searches to Improve the Retrieval of MEDLINE documents , 2006, CORIA.

[10]  Padmini Srinivasan,et al.  Hierarchical Text Categorization Using Neural Networks , 2004, Information Retrieval.

[11]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[12]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[13]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[14]  Dietrich Rebholz-Schuhmann,et al.  Combining Evidence, Specificity, and Proximity towards the Normalization of Gene Ontology Terms in Text , 2008, EURASIP J. Bioinform. Syst. Biol..

[15]  Djoerd Hiemstra,et al.  Twenty-One at TREC7: Ad-hoc and Cross-Language Track , 1998, TREC.

[16]  I. Simon,et al.  A probabilistic generative model for GO enrichment analysis , 2008, Nucleic acids research.

[17]  Sunghwan Sohn,et al.  Research Paper: Optimal Training Sets for Bayesian Prediction of MeSH® Assignment , 2008, J. Am. Medical Informatics Assoc..

[18]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[19]  Marti A. Hearst,et al.  TREC 2007 Genomics Track Overview , 2007, TREC.

[20]  Jimmy J. Lin,et al.  PubMed related articles: a probabilistic topic-based model for content similarity , 2007, BMC Bioinformatics.

[21]  Padmini Srinivasan,et al.  Research Paper: Retrieval Feedback in MEDLINE , 1996, J. Am. Medical Informatics Assoc..

[22]  Marek Reformat,et al.  Multilabel associative classification categorization of MEDLINE articles into MeSH keywords. , 2007, IEEE engineering in medicine and biology magazine : the quarterly magazine of the Engineering in Medicine & Biology Society.

[23]  Ibrahim Emam,et al.  ArrayExpress update—from an archive of functional genomics experiments to the atlas of gene expression , 2008, Nucleic Acids Res..

[24]  Dolf Trieschnigg,et al.  Cross Language Information Retrieval for Biomedical Literature , 2007, TREC.

[25]  W. John Wilbur,et al.  Automatic MeSH term assignment and quality assessment , 2001, AMIA.

[26]  Stephen E. Robertson,et al.  Okapi at TREC-4 , 1995, TREC.

[27]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.