Fast and scalable neural embedding models for biomedical sentence classification

BackgroundBiomedical literature is expanding rapidly, and tools that help locate information of interest are needed. To this end, a multitude of different approaches for classifying sentences in biomedical publications according to their coarse semantic and rhetoric categories (e.g., Background, Methods, Results, Conclusions) have been devised, with recent state-of-the-art results reported for a complex deep learning model. Recent evidence showed that shallow and wide neural models such as fastText can provide results that are competitive or superior to complex deep learning models while requiring drastically lower training times and having better scalability. We analyze the efficacy of the fastText model in the classification of biomedical sentences in the PubMed 200k RCT benchmark, and introduce a simple pre-processing step that enables the application of fastText on sentence sequences. Furthermore, we explore the utility of two unsupervised pre-training approaches in scenarios where labeled training data are limited.ResultsOur fastText-based methodology yields a state-of-the-art F1 score of.917 on the PubMed 200k benchmark when sentence ordering is taken into account, with a training time of only 73 s on standard hardware. Applying fastText on single sentences, without taking sentence ordering into account, yielded an F1 score of.852 (training time 13 s). Unsupervised pre-training of N-gram vectors greatly improved the results for small training set sizes, with an increase of F1 score of.21 to.74 when trained on only 1000 randomly picked sentences without taking sentence ordering into account.ConclusionsBecause of it’s ease of use and performance, fastText should be among the first choices of tools when tackling biomedical text classification problems with large corpora. Unsupervised pre-training of N-gram vectors on domain-specific corpora also makes it possible to apply fastText when labeled training data are limited.

[1]  Richard Tzong-Han Tsai,et al.  Using conditional random fields for result identification in biomedical abstracts , 2009, Integr. Comput. Aided Eng..

[2]  Giovanni Ulivi,et al.  Combinatorial analysis and algorithms for quasispecies reconstruction using next-generation sequencing , 2011, BMC Bioinformatics.

[3]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[4]  Tomas Mikolov,et al.  Fast Linear Model for Knowledge Graph Embeddings , 2017, AKBC@NIPS.

[5]  David Martínez,et al.  Automatic classification of sentences to support Evidence Based Medicine , 2011, BMC Bioinformatics.

[6]  Franck Dernoncourt,et al.  PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts , 2017, IJCNLP.

[7]  Jimmy J. Lin,et al.  Generative Content Models for Structural Analysis of Medical Abstracts , 2006, BioNLP@NAACL-HLT.

[8]  Franck Dernoncourt,et al.  Neural Networks for Joint Sentence Classification in Medical Paper Abstracts , 2017, EACL.

[9]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[10]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[11]  Allan Hanbury,et al.  An Open-Source, Mobile-Friendly Search Engine for Public Medical Knowledge , 2014, MIE.

[12]  Matteo Pagliardini,et al.  Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features , 2017, NAACL.

[13]  Anna Korhonen,et al.  Weakly supervised learning of information structure of scientific abstracts - is it accurate enough to benefit real-world tasks in biomedicine? , 2011, Bioinform..

[14]  Dietrich Rebholz-Schuhmann,et al.  Using argumentation to extract key sentences from biomedical abstracts , 2007, Int. J. Medical Informatics.

[15]  Yasunori Yamamoto,et al.  A Sentence Classification System for Multi Biomedical Literature Summarization , 2005, 21st International Conference on Data Engineering Workshops (ICDEW'05).

[16]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[17]  Senator Jeong,et al.  Structuralizing biomedical abstracts with discriminative linguistic features , 2016, Comput. Biol. Medicine.

[18]  Jau-Min Wong,et al.  PICO element detection in medical text without metadata: Are first sentences enough? , 2013, J. Biomed. Informatics.

[19]  Maria Liakata,et al.  Identifying the Information Structure of Scientific Abstracts: An Investigation of Three Different Schemes , 2010, BioNLP@ACL.

[20]  Naoaki Okazaki,et al.  Identifying Sections in Scientific Abstracts using Conditional Random Fields , 2008, IJCNLP.