MEDLINE MeSH indexing: lessons learned from machine learning and future directions

Due to the large yearly growth of MEDLINE, MeSH indexing is becoming a more difficult task for a relatively small group of highly qualified indexing staff at the US National Library of Medicine (NLM). The Medical Text Indexer (MTI) is a support tool for assisting indexers; this tool relies on MetaMap and a k-NN approach called PubMed Related Citations (PRC). Our motivation is to improve the quality of MTI based on machine learning. Typical machine learning approaches fit this indexing task into text categorization. In this work, we have studied some Medical Subject Headings (MeSH) recommended by MTI and analyzed the issues when using standard machine learning algorithms. We show that in some cases machine learning can improve the annotations already recommended by MTI, that machine learning based on low variance methods achieves better performance and that each MeSH heading presents a different behavior. In addition, there are several factors which make this task difficult (e.g. limited access to the full-text of the citations) which provide direction for future work.

[1]  Wanda Pratt,et al.  The Effect of Feature Representation on MEDLINE Document Classification , 2005, AMIA.

[2]  Yindalon Aphinyanagphongs,et al.  Research Paper: Text Categorization Models for High-Quality Article Retrieval in Internal Medicine , 2004, J. Am. Medical Informatics Assoc..

[3]  Olivier Bodenreider,et al.  Utilizing the UMLS for Semantic Mapping between Terminologies , 2005, AMIA.

[4]  Susanne M. Humphrey,et al.  Automatic Indexing of Documents from Journal Descriptors: A Preliminary Investigation , 1999, J. Am. Soc. Inf. Sci..

[5]  Wessel Kraaij,et al.  MeSH Up: effective MeSH text classification for improved document retrieval , 2009, Bioinform..

[6]  M E Funk,et al.  Indexing consistency in MEDLINE. , 1983, Bulletin of the Medical Library Association.

[7]  Olivier Bodenreider,et al.  The NLM Indexing Initiative , 2000, AMIA.

[8]  Vincent Claveau,et al.  Automatic inference of indexing rules for MEDLINE , 2008, BMC Bioinformatics.

[9]  Susanne M. Humphrey,et al.  The NLM Indexing Initiative's Medical Text Indexer , 2004, MedInfo.

[10]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[11]  Patrick Ruch,et al.  Automatic assignment of biomedical categories: toward a generic approach , 2006, Bioinform..

[12]  Xindong Wu,et al.  Eliminating Class Noise in Large Datasets , 2003, ICML.

[13]  Thorsten Joachims,et al.  A support vector method for multivariate performance measures , 2005, ICML.

[14]  Jimmy J. Lin,et al.  PubMed related articles: a probabilistic topic-based model for content similarity , 2007, BMC Bioinformatics.

[15]  James G. Mork,et al.  A bottom-up approach to MEDLINE indexing recommendations. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..