FullMeSH: improving large-scale MeSH indexing with full text

MOTIVATION With the rapidly growing biomedical literature, automatically indexing biomedical articles by Medical Subject Heading (MeSH), namely MeSH indexing, has become increasingly important for facilitating hypothesis generation and knowledge discovery. Over the past years, many large-scale MeSH indexing approaches have been proposed, such as Medical Text Indexer (MTI), MeSHLabeler, DeepMeSH and MeSHProbeNet. However, the performance of these methods is hampered by using limited information, i.e. only the title and abstract of biomedical articles. RESULTS We propose FullMeSH, a large-scale MeSH indexing method taking advantage of the recent increase in the availability of full text articles. Compared to DeepMeSH and other state-of-the-art methods, FullMeSH has three novelties: 1) Instead of using a full text as a whole, FullMeSH segments it into several sections with their normalized titles in order to distinguish their contributions to the overall performance. 2) FullMeSH integrates the evidence from different sections in a "learning to rank" framework by combining the sparse and deep semantic representations. 3) FullMeSH trains an Attention-based Convolutional Neural Network (AttentionCNN) for each section, which achieves better performance on infrequent MeSH headings. FullMeSH has been developed and empirically trained on the entire set of 1.4 million full-text articles in the PubMed Central Open Access subset. It achieved a Micro F-measure of 66.76% on a test set of 10,000 articles, which was 3.3% and 6.4% higher than DeepMeSH and MeSHLabeler, respectively. Furthermore, FullMeSH demonstrated an average improvement of 4.7% over DeepMeSH for indexing Check Tags, a set of most frequently indexed MeSH headings. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Zhiyong Lu,et al.  MeSH Now: automatic MeSH indexing at PubMed scale via learning to rank , 2017, Journal of Biomedical Semantics.

[2]  Antonio Jimeno-Yepes,et al.  MeSH indexing based on automatically generated summaries , 2013, BMC Bioinformatics.

[3]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[4]  Zhiyong Lu,et al.  Recommending MeSH terms for annotating biomedical articles , 2011, J. Am. Medical Informatics Assoc..

[5]  Aidong Zhang,et al.  MeSHProbeNet: a self-attentive probe net for MeSH indexing , 2019, Bioinform..

[6]  Georgios Balikas,et al.  An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition , 2015, BMC Bioinformatics.

[7]  Susanne M. Humphrey,et al.  The NLM Indexing Initiative's Medical Text Indexer , 2004, MedInfo.

[8]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[9]  Jun Gu,et al.  Efficient Semisupervised MEDLINE Document Clustering With MeSH-Semantic and Global-Content Constraints , 2013, IEEE Transactions on Cybernetics.

[10]  Dina Demner-Fushman,et al.  Extracting Characteristics of the Study Subjects from Full-Text Articles , 2015, AMIA.

[11]  Dina Demner-Fushman,et al.  12 years on – Is the NLM medical text indexer still useful and relevant? , 2017, Journal of Biomedical Semantics.

[12]  Yi Li,et al.  Exploring criteria for successful query expansion in the genomic domain , 2009, Information Retrieval.

[13]  ChengXiang Zhai,et al.  MeSHLabeler: improving the accuracy of large-scale MeSH indexing by integrating diverse evidence , 2015, Bioinform..

[14]  Alan R. Aronson,et al.  Semi-Automatic Indexing of Full Text Biomedical Articles , 2005, AMIA.

[15]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[16]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[17]  Zhiyong Lu,et al.  PMC text mining subset in BioC: about three million full-text articles and growing , 2019, Bioinform..

[18]  Zhiyong Lu,et al.  Evaluation of query expansion using MeSH in PubMed , 2009, Information Retrieval.

[19]  Jia Zeng,et al.  Enhancing MEDLINE document clustering by incorporating MeSH semantic similarity , 2009, Bioinform..

[20]  ChengXiang Zhai,et al.  DeepMeSH: deep semantic representation for improving large-scale MeSH indexing , 2016, Bioinform..

[21]  ChengXiang Zhai,et al.  An empirical study of tokenization strategies for biomedical information retrieval , 2007, Information Retrieval.