A High Recall Classifier for Selecting Articles for MEDLINE Indexing

MEDLINE is the National Library of Medicine's premier bibliographic database for biomedical literature. A highly valuable feature of the database is that each record is manually indexed with a controlled vocabulary called MeSH. Most MEDLINE journals are indexed cover-to-cover, but there are about 200 selectively indexed journals for which only articles related to biomedicine and life sciences are indexed. In recent years, the selection process has become an increasing burden for indexing staff, and this paper presents a machine learning based system that offers very significant time savings by semi-automating the task. At the core of the system is a high recall classifier for the identification ofjournal articles that are in-scope for MEDLINE. The system is shown to reduce the number of articles requiring manual review by 54%, equivalent to approximately 40,000 articles per year.

[1]  William R Hersh,et al.  The TREC 2004 genomics track categorization task: classifying full text biomedical documents , 2006, Journal of biomedical discovery and collaboration.

[2]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[3]  William W. Cohen,et al.  AttentionMeSH: Simple, Effective and Interpretable Automatic MeSH Indexer , 2018 .

[4]  Guilherme Del Fiol,et al.  Automatic identification of recent high impact clinical articles in PubMed to support clinical decision making using time-agnostic features , 2019, J. Biomed. Informatics.

[5]  Matthew Michelson,et al.  A Deep Learning Method to Automatically Identify Reports of Scientifically Rigorous Clinical Research from the Biomedical Literature: Comparative Analytic Study , 2018, Journal of medical Internet research.

[6]  Yiming Yang,et al.  Deep Learning for Extreme Multi-label Text Classification , 2017, SIGIR.

[7]  Ye Zhang,et al.  A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification , 2015, IJCNLP.

[8]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[9]  Dina Demner-Fushman,et al.  12 years on – Is the NLM medical text indexer still useful and relevant? , 2017, Journal of Biomedical Semantics.

[10]  Thomas C. Wiegers,et al.  Collaborative biocuration—text-mining development task for document prioritization for curation , 2012, Database J. Biol. Databases Curation.

[11]  Chris J. Lu,et al.  Journal Descriptor Indexing Tool for Categorizing Text According to Discipline or Semantic Type , 2006, AMIA.

[12]  ChengXiang Zhai,et al.  DeepMeSH: deep semantic representation for improving large-scale MeSH indexing , 2016, Bioinform..

[13]  Halil Kilicoglu,et al.  Viewpoint Paper: Towards Automatic Recognition of Scientifically Rigorous Clinical Research Evidence , 2009, J. Am. Medical Informatics Assoc..

[14]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..