Sequential Span Classification with Neural Semi-Markov CRFs for Biomedical Abstracts

Dividing biomedical abstracts into several segments with rhetorical roles is essential for supporting researchers’ information access in the biomedical domain. Conventional methods have regarded the task as a sequence labeling task based on sequential sentence classification, i.e., they assign a rhetorical label to each sentence by considering the context in the abstract. However, these methods have a critical problem: they are prone to mislabel longer continuous sentences with the same rhetorical label. To tackle the problem, we propose sequential span classification that assigns a rhetorical label, not to a single sentence but to a span that consists of continuous sentences. Accordingly, we introduce Neural Semi-Markov Conditional Random Fields to assign the labels to such spans by considering all possible spans of various lengths. Experimental results obtained from PubMed 20k RCT and NICTA-PIBOSO datasets demonstrate that our proposed method achieved the best micro sentence-F1 score as well as the best micro span-F1 score.

[1]  Dan Klein,et al.  A Minimal Span-Based Neural Constituency Parser , 2017, ACL.

[2]  Heike Adel,et al.  Neural Semi-Markov Conditional Random Fields for Robust Character-Based Part-of-Speech Tagging , 2019, NAACL-HLT.

[3]  Daniel Jurafsky,et al.  Predicting the Rise and Fall of Scientific Topics from Trends in their Rhetorical Framing , 2016, ACL.

[4]  Anna Korhonen,et al.  Active learning-based information structure analysis of full scientific articles and two applications for biomedical literature review , 2013, Bioinform..

[5]  Franck Dernoncourt,et al.  Neural Networks for Joint Sentence Classification in Medical Paper Abstracts , 2017, EACL.

[6]  Zhen-Hua Ling,et al.  Hybrid semi-Markov CRF for Neural Sequence Labeling , 2018, ACL.

[7]  Zhiyong Lu,et al.  Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets , 2019, BioNLP@ACL.

[8]  Bhavana Dalvi,et al.  Pretrained Language Models for Sequential Sentence Classification , 2019, EMNLP/IJCNLP.

[9]  Jane Hunter,et al.  Identifying scientific artefacts in biomedical literature: The Evidence Based Medicine use case , 2014, J. Biomed. Informatics.

[10]  Hsin-Hsi Chen,et al.  DISA: A Scientific Writing Advisor with Deep Information Structure Analysis , 2017, IJCAI.

[11]  Jimmy J. Lin,et al.  Generative Content Models for Structural Analysis of Medical Abstracts , 2006, BioNLP@NAACL-HLT.

[12]  Naoaki Okazaki,et al.  Identifying Sections in Scientific Abstracts using Conditional Random Fields , 2008, IJCNLP.

[13]  Hiroyuki Shindo,et al.  A Span Selection Model for Semantic Role Labeling , 2018, EMNLP.

[14]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[15]  Luke S. Zettlemoyer,et al.  End-to-end Neural Coreference Resolution , 2017, EMNLP.

[16]  Peter Szolovits,et al.  Hierarchical Neural Networks for Sequential Sentence Classification in Medical Scientific Abstracts , 2018, EMNLP.

[17]  David Martínez,et al.  Automatic classification of sentences to support Evidence Based Medicine , 2011, BMC Bioinformatics.

[18]  Franck Dernoncourt,et al.  PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts , 2017, IJCNLP.