Convolutional Neural Network for Automatic MeSH Indexing

MEDLINE is the indexed subset of the National Library of Medicine’s (NLM) journal citation database. It currently contains over 25 million biomedical citations, each indexed with a controlled vocabulary called MeSH. Since 1990, there has been a sizable increase in the number of articles indexed each year for MEDLINE, and since 2002, the NLM has been using automatic MeSH indexing systems to assist indexers with their increasing workload. This paper explores a deep learning approach to the automatic MeSH indexing problem. We present a Convolutional Neural Network (CNN) for automatic MeSH indexing and evaluate its performance by participating in the BioASQ 2019 task on large-scale online biomedical semantic indexing. The CNN model demonstrates competitive performance and outperforms the NLM’s Medical Text Indexer (MTI) by about 3%. The paper presents a preliminary analysis comparing the results of the CNN model to MTI and also outlines the advantages of end-to-end deep learning approaches to automatic MeSH indexing.

[1]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[2]  ChengXiang Zhai,et al.  DeepMeSH: deep semantic representation for improving large-scale MeSH indexing , 2016, Bioinform..

[3]  William W. Cohen,et al.  AttentionMeSH: Simple, Effective and Interpretable Automatic MeSH Indexer , 2018 .

[4]  Frank Keller,et al.  Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL , 2014, EMNLP.

[5]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[6]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[7]  Olivier Bodenreider,et al.  Beyond synonymy: exploiting the UMLS semantics in mapping vocabularies , 1998, AMIA.

[8]  Jimmy J. Lin,et al.  PubMed related articles: a probabilistic topic-based model for content similarity , 2007, BMC Bioinformatics.

[9]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[10]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[11]  Dina Demner-Fushman,et al.  12 years on – Is the NLM medical text indexer still useful and relevant? , 2017, Journal of Biomedical Semantics.

[12]  Yiming Yang,et al.  Deep Learning for Extreme Multi-label Text Classification , 2017, SIGIR.

[13]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[14]  Ramakanth Kavuluru,et al.  Convolutional neural networks for biomedical text classification: application in indexing biomedical articles , 2015, BCB.

[15]  Georgios Balikas,et al.  An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition , 2015, BMC Bioinformatics.

[16]  Aidong Zhang,et al.  MeSHProbeNet: a self-attentive probe net for MeSH indexing , 2019, Bioinform..

[17]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[18]  Dina Demner-Fushman,et al.  Using Learning-To-Rank to Enhance NLM Medical Text Indexer Results , 2016, Proceedings of the Fourth BioASQ workshop.