Scientific Discourse Tagging for Evidence Extraction

Evidence plays a crucial role in any biomedical research narrative, providing justification for some claims and refutation for others. We seek to build models of scientific argument using information extraction methods from full-text papers. We present the capability of automatically extracting text fragments from primary research papers that describe the evidence presented in that paper's figures, which arguably provides the raw material of any scientific argument made within the paper. We apply richly contextualized deep representation learning pre-trained on biomedical domain corpus to the analysis of scientific discourse structures and the extraction of"evidence fragments"(i.e., the text in the results section describing data presented in a specified subfigure) from a set of biomedical experimental research articles. We first demonstrate our state-of-the-art scientific discourse tagger on two scientific discourse tagging datasets and its transferability to new datasets. We then show the benefit of leveraging scientific discourse tags for downstream tasks such as claim-extraction and evidence fragment detection. Our work demonstrates the potential of using evidence fragments derived from figure spans for improving the quality of scientific claims by cataloging, indexing and reusing evidence fragments as independent documents.

[1]  Geoffrey E. Hinton,et al.  Transforming Auto-Encoders , 2011, ICANN.

[2]  Jerry R. Hobbs Information extraction from biomedical text , 2002, J. Biomed. Informatics.

[3]  Bahar Sateli,et al.  Semantic representation of scientific literature: bringing claims, contributions and named entities onto the Linked Open Data cloud , 2015, PeerJ Comput. Sci..

[4]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[5]  Anita de Waard,et al.  Optimized Machine Learning Methods Predict Discourse Segment Type in Biological Research Articles , 2018, SAVE-SD@WWW.

[6]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[7]  Maria Liakata,et al.  Zones of conceptualisation in scientific papers: a window to negative and speculative statements , 2010, NeSp-NLP@ACL.

[8]  Serena Villata,et al.  Argument Mining on Twitter: Arguments, Facts and Sources , 2017, EMNLP.

[9]  Tapio Salakoski,et al.  Distributional Semantics Resources for Biomedical Text Processing , 2013 .

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[12]  Marc Moens,et al.  Discourse-level argumentation in scientific articles: human and automatic annotation , 1999 .

[13]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[14]  Eduard H. Hovy,et al.  Automated detection of discourse segment and experimental types from the text of cancer pathway results sections , 2016, Database J. Biol. Databases Curation.

[15]  Nanyun Peng,et al.  Named Entity Recognition for Chinese Social Media with Jointly Trained Embeddings , 2015, EMNLP.

[16]  Iryna Gurevych,et al.  Argumentation Mining in Persuasive Essays and Scientific Articles from the Discourse Structure Perspective , 2014, ArgNLP.

[17]  Dietrich Rebholz-Schuhmann,et al.  Automatic recognition of conceptualization zones in scientific articles and two life science applications , 2012, Bioinform..

[18]  Nanyun Peng,et al.  Building deep learning models for evidence classification from the open access biomedical literature , 2019, Database J. Biol. Databases Curation.

[19]  Carolyn L. Talcott,et al.  Pathway Logic: Executable Models of Biological Networks , 2004, WRLA.

[20]  Lovekesh Vig,et al.  Hierarchical Capsule Based Neural Network Architecture for Sequence Labeling , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[21]  Vangelis Karkaletsis,et al.  Argument Extraction from News , 2015, ArgMining@HLT-NAACL.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Michael J. Baker,et al.  The role of argumentation in online epistemic communities: the anatomy of a conflict in Wikipedia , 2010, ECCE.

[24]  Andreas Vlachos,et al.  The Fact Extraction and VERification (FEVER) Shared Task , 2018, FEVER@EMNLP.

[25]  Peter Szolovits,et al.  Hierarchical Neural Networks for Sequential Sentence Classification in Medical Scientific Abstracts , 2018, EMNLP.

[26]  Anita de Waard,et al.  Epistemic Modality and Knowledge Attribution in Scientific Discourse: A Taxonomy of Types and Overview of Features , 2012 .

[27]  Eduard H. Hovy,et al.  Extracting Evidence Fragments for Distant Supervision of Molecular Interactions , 2017, SemSci@ISWC.

[28]  Li Dong,et al.  Learning a Unified Named Entity Tagger from Multiple Partially Annotated Corpora for Efficient Adaptation , 2019, CoNLL.

[29]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[30]  Erik F. Tjong Kim Sang,et al.  Representing Text Chunks , 1999, EACL.

[31]  Maria Liakata,et al.  Identifying the Information Structure of Scientific Abstracts: An Investigation of Three Different Schemes , 2010, BioNLP@ACL.

[32]  Iryna Gurevych,et al.  Argumentation Mining on the Web from Information Seeking Perspective , 2014, ArgNLP.

[33]  Franck Dernoncourt,et al.  PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts , 2017, IJCNLP.

[34]  Kyle Lo,et al.  SciBERT: Pretrained Contextualized Embeddings for Scientific Text , 2019, ArXiv.

[35]  Carole A. Goble,et al.  Micropublications: a semantic model for claims, evidence, arguments and annotations in biomedical communications , 2013, Journal of Biomedical Semantics.

[36]  Rafael C. Jimenez,et al.  The MIntAct project—IntAct as a common curation platform for 11 molecular interaction databases , 2013, Nucleic Acids Res..

[37]  Naoaki Okazaki,et al.  Identifying Sections in Scientific Abstracts using Conditional Random Fields , 2008, IJCNLP.

[38]  Eric Nichols,et al.  Named Entity Recognition with Bidirectional LSTM-CNNs , 2015, TACL.

[39]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[40]  Nanyun Peng,et al.  Multi-task Multi-domain Representation Learning for Sequence Tagging , 2016, ArXiv.

[41]  Andrew Y. Ng,et al.  Parsing with Compositional Vector Grammars , 2013, ACL.

[42]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[43]  Eduard H. Hovy,et al.  Experiment Segmentation in Scientific Discourse as Clause-level Structured Prediction using Recurrent Neural Networks , 2017, ArXiv.

[44]  Konrad P. Körding,et al.  Claim Extraction in Biomedical Publications using Deep Discourse Model and Transfer Learning , 2019, ArXiv.

[45]  C. Lee Giles,et al.  CODA-19: Reliably Annotating Research Aspects on 10,000+ CORD-19 Abstracts Using a Non-Expert Crowd , 2020, ArXiv.

[46]  Nanyun Peng,et al.  Improving Named Entity Recognition for Chinese Social Media with Word Segmentation Representation Learning , 2016, ACL.

[47]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[48]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[49]  Kevin Duh,et al.  A Multi-task Learning Approach to Adapting Bilingual Word Embeddings for Cross-lingual Named Entity Recognition , 2017, IJCNLP.

[50]  Marc Moens,et al.  Articles Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status , 2002, CL.

[51]  Oren Etzioni,et al.  CORD-19: The Covid-19 Open Research Dataset , 2020, NLPCOVID19.