NLPContributions: An Annotation Scheme for Machine Reading of Scholarly Contributions in Natural Language Processing Literature

We describe an annotation initiative to capture the scholarly contributions in natural language processing (NLP) articles, particularly, for the articles that discuss machine learning (ML) approaches for various information extraction tasks. We develop the annotation task based on a pilot annotation exercise on 50 NLP-ML scholarly articles presenting contributions to five information extraction tasks 1. machine translation, 2. named entity recognition, 3. question answering, 4. relation classification, and 5. text classification. In this article, we describe the outcomes of this pilot annotation phase. Through the exercise we have obtained an annotation methodology; and found ten core information units that reflect the contribution of the NLP-ML scholarly investigations. The resulting annotation scheme we developed based on these information units is called NLPContributions. The overarching goal of our endeavor is four-fold: 1) to find a systematic set of patterns of subject-predicate-object statements for the semantic structuring of scholarly contributions that are more or less generically applicable for NLP-ML research articles; 2) to apply the discovered patterns in the creation of a larger annotated dataset for training machine readers of research contributions; 3) to ingest the dataset into the Open Research Knowledge Graph (ORKG) infrastructure as a showcase for creating user-friendly state-of-the-art overviews; 4) to integrate the machine readers into the ORKG to assist users in the manual curation of their respective article contributions. We envision that the NLPContributions methodology engenders a wider discussion on the topic toward its further refinement and development. Our pilot annotated dataset of 50 NLP-ML scholarly articles according to the NLPContributions scheme is openly available to the research community at this https URL.

[1]  Vayianos Pertsas,et al.  Scholarly Ontology: modelling scholarly practices , 2017, International Journal on Digital Libraries.

[2]  Sören Auer,et al.  Towards an Open Research Knowledge Graph , 2018, The Serials Librarian.

[3]  Jeffrey Ling,et al.  Matching the Blanks: Distributional Similarity for Relation Learning , 2019, ACL.

[4]  Horacio Saggion,et al.  A Multi-Layered Annotated Corpus of Scientific Papers , 2016, LREC.

[5]  Wei Lu,et al.  Attention Guided Graph Convolutional Networks for Relation Extraction , 2019, ACL.

[6]  Christopher D. Manning,et al.  Graph Convolution over Pruned Dependency Trees Improves Relation Extraction , 2018, EMNLP.

[7]  Andrew McCallum,et al.  The Materials Science Procedural Text Corpus: Annotating Materials Synthesis Procedures with Shallow Semantic Structures , 2019, LAW@ACL.

[8]  Mark Ware,et al.  The STM report: An overview of scientific and scholarly journal publishing fourth edition , 2015 .

[9]  Simone Teufel,et al.  Corpora for the Conceptualisation and Zoning of Scientific Papers , 2010, LREC.

[10]  Mari Ostendorf,et al.  Scientific Information Extraction with Semi-supervised Neural Tagging , 2017, EMNLP.

[11]  Stefanie N. Lindstaedt,et al.  Realising the European Open Science Cloud , 2016 .

[12]  Ross D King,et al.  An ontology of scientific experiments , 2006, Journal of The Royal Society Interface.

[13]  Raghu Machiraju,et al.  An Annotated Corpus for Machine Reading of Instructions in Wet Lab Protocols , 2018, NAACL.

[14]  Maria-Esther Vidal,et al.  Towards a Knowledge Graph for Science , 2018, WIMS.

[15]  Anthony R. Davis,et al.  Interactions between Narrative Schemas and Document Categories , 2015 .

[16]  Gerbrand Ceder,et al.  Text-mined dataset of inorganic materials synthesis recipes , 2019, Scientific Data.

[17]  Jens Lehmann,et al.  MEX vocabulary: a lightweight interchange format for machine learning experiments , 2015, SEMANTICS.

[18]  Simone Teufel Towards Discipline-Independent Argumentative Zoning : Evidence from Chemistry and Computational Linguistics , 2009 .

[19]  Sören Auer,et al.  Open Research Knowledge Graph: Next Generation Infrastructure for Semantic Scholarly Knowledge , 2019, K-CAP.

[20]  Yue Zhang,et al.  Sentence-State LSTM for Text Representation , 2018, ACL.

[21]  A. Oelen,et al.  Generate FAIR Literature Surveys with Scholarly Knowledge Graphs , 2020, JCDL.

[22]  Roger C. Schank,et al.  Scripts, plans, goals and understanding: an inquiry into human knowledge structures , 1978 .

[23]  Mari Ostendorf,et al.  Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction , 2018, EMNLP.

[24]  Anthony R. Davis,et al.  NASTEA: Investigating Narrative Schemas through Annotated Entities , 2016 .

[25]  Nathanael Chambers,et al.  Unsupervised Learning of Narrative Schemas and their Participants , 2009, ACL.

[26]  Oren Etzioni,et al.  Machine Reading , 2006, AAAI.

[27]  Makoto Miwa,et al.  Annotating and Extracting Synthesis Process of All-Solid-State Batteries from Scientific Literature , 2020, LREC.

[28]  Doug Downey,et al.  Construction of the Literature Graph in Semantic Scholar , 2018, NAACL.

[29]  Makoto Miwa,et al.  End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures , 2016, ACL.

[30]  John P A Ioannidis,et al.  The Mass Production of Redundant, Misleading, and Conflicted Systematic Reviews and Meta-analyses. , 2016, The Milbank quarterly.

[31]  Arif E. Jinha Article 50 million: an estimate of the number of scholarly articles in existence , 2010, Learn. Publ..

[32]  Yong Suk Choi,et al.  Semantic Relation Classification via Bidirectional LSTM Networks with Entity-aware Attention using Latent Entity Typing , 2019, Symmetry.

[33]  Siegfried Handschuh,et al.  The ACL RD-TEC: A Dataset for Benchmarking Terminology Extraction and Classification in Computational Linguistics , 2014 .

[34]  Simone Teufel,et al.  Towards Domain-Independent Argumentative Zoning: Evidence from Chemistry and Computational Linguistics , 2009, EMNLP.

[35]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[36]  Fabio Vitali,et al.  The Document Components Ontology (DoCO) , 2016, Semantic Web.

[37]  Mo Yu,et al.  Extracting Multiple-Relations in One-Pass with Pre-Trained Transformers , 2019, ACL.

[38]  Nathanael Chambers,et al.  Event Schema Induction with a Probabilistic Entity-Driven Model , 2013, EMNLP.

[39]  Ralph Ewerth,et al.  The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources , 2020, LREC.

[40]  Isabelle Augenstein,et al.  SemEval 2017 Task 10: ScienceIE - Extracting Keyphrases and Relations from Scientific Publications , 2017, *SEMEVAL.

[41]  Sören Auer,et al.  Toward Representing Research Contributions in Scholarly Knowledge Graphs Using Knowledge Graph Cells , 2020, JCDL.

[42]  Chandra Bhagavatula,et al.  The AI2 system at SemEval-2017 Task 10 (ScienceIE): semi-supervised end-to-end entity and relation extraction , 2017, *SEMEVAL.

[43]  Anthony R. Davis,et al.  Narrative Schema Stability in News Text , 2018, COLING.

[44]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[45]  Christoph Lange,et al.  Towards a Knowledge Graph Representing Research Findings by Semantifying Survey Articles , 2017, TPDL.

[46]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[47]  Jason Weston,et al.  Open Question Answering with Weakly Supervised Embedding Models , 2014, ECML/PKDD.

[48]  Ralph Ewerth,et al.  Domain-Independent Extraction of Scientific Concepts from Research Articles , 2020, ECIR.

[49]  Sören Auer,et al.  Comparing Research Contributions in a Scholarly Knowledge Graph , 2019, SciKnow@K-CAP.

[50]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[51]  Jean Carletta,et al.  An annotation scheme for discourse-level argumentation in research articles , 1999, EACL.

[52]  Dietrich Rebholz-Schuhmann,et al.  Automatic recognition of conceptualization zones in scientific articles and two life science applications , 2012, Bioinform..

[53]  Nathanael Chambers,et al.  Unsupervised Learning of Narrative Event Chains , 2008, ACL.

[54]  Oren Etzioni,et al.  Generating Coherent Event Schemas at Scale , 2013, EMNLP.

[55]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..