Comparison and combination of several MeSH indexing approaches

MeSH indexing of MEDLINE is becoming a more difficult task for the group of highly qualified indexing staff at the US National Library of Medicine, due to the large yearly growth of MEDLINE and the increasing size of MeSH. Since 2002, this task has been assisted by the Medical Text Indexer or MTI program. We extend previous machine learning analysis by adding a more diverse set of MeSH headings targeting examples where MTI has been shown to perform poorly. Machine learning algorithms exceed MTI's performance on MeSH headings that are used very frequently and headings for which the indexing frequency is very low. We find that when we combine the MTI suggestions and the prediction of the learning algorithms, the performance improves compared to any single method for most of the evaluated MeSH headings.

[1]  W. John Wilbur,et al.  Text Mining Techniques for Leveraging Positively Labeled Data , 2011, BioNLP@ACL.

[2]  W. Bruce Croft Combining Approaches to Information Retrieval , 2002 .

[3]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[4]  Wanda Pratt,et al.  The Effect of Feature Representation on MEDLINE Document Classification , 2005, AMIA.

[5]  Antonio Jimeno-Yepes Automatic algorithm selection for MeSH Heading indexing based on meta-learning , 2011 .

[6]  Antonio Jimeno-Yepes,et al.  MEDLINE MeSH indexing: lessons learned from machine learning and future directions , 2012, IHI '12.

[7]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[8]  Olivier Bodenreider,et al.  Beyond synonymy: exploiting the UMLS semantics in mapping vocabularies , 1998, AMIA.

[9]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[10]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[11]  Patrick Gallinari,et al.  Text Classification: A Sequential Reading Approach , 2011, ECIR.

[12]  Patrick Ruch,et al.  Automatic assignment of biomedical categories: toward a generic approach , 2006, Bioinform..

[13]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[14]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[15]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[16]  Antonio Jimeno-Yepes,et al.  A One-Size-Fits-All Indexing Method Does Not Exist: Automatic Selection Based on Meta-Learning , 2012, J. Comput. Sci. Eng..

[17]  Zhiyong Lu,et al.  Recommending MeSH terms for annotating biomedical articles , 2011, J. Am. Medical Informatics Assoc..

[18]  Sophia Ananiadou,et al.  Proceedings of BioNLP 2011 Workshop , 2011 .

[19]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[20]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[21]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[22]  Russ B. Altman,et al.  MScanner: a classifier for retrieving Medline citations , 2008, BMC Bioinformatics.

[23]  Olivier Bodenreider,et al.  The NLM Indexing Initiative , 2000, AMIA.

[24]  Vincent Claveau,et al.  Automatic inference of indexing rules for MEDLINE , 2008, BMC Bioinformatics.

[25]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[26]  James G. Mork,et al.  A bottom-up approach to MEDLINE indexing recommendations. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[27]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[28]  Yindalon Aphinyanagphongs,et al.  Text Categorization Models for Retrieval of High Quality Articles in Internal Medicine , 2003, AMIA.

[29]  Xinghua Lu,et al.  Mapping annotations with textual evidence using an scLDA model. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[30]  M E Funk,et al.  Indexing consistency in MEDLINE. , 1983, Bulletin of the Medical Library Association.

[31]  Trevor Cohen,et al.  Deterministic Binary Vectors for Efficient Automated Indexing of MEDLINE/PubMed Abstracts , 2012, AMIA.

[32]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[33]  Jimmy J. Lin,et al.  PubMed related articles: a probabilistic topic-based model for content similarity , 2007, BMC Bioinformatics.

[34]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[35]  Wessel Kraaij,et al.  MeSH Up: effective MeSH text classification for improved document retrieval , 2009, Bioinform..

[36]  Padmini Srinivasan,et al.  Hierarchical neural networks for text categorization (poster abstract) , 1999, SIGIR '99.