A Multi-level Annotated Corpus of Scientific Papers for Scientific Document Summarization and Cross-document Relation Discovery

Related work sections or literature reviews are an essential part of every scientific article being crucial for paper reviewing and assessment. The automatic generation of related work sections can be considered an instance of the multi-document summarization problem. In order to allow the study of this specific problem, we have developed a manually annotated, machine readable data-set of related work sections, cited papers (e.g. references) and sentences, together with an additional layer of papers citing the references. We additionally present experiments on the identification of cited sentences, using as input citation contexts. The corpus alongside the gold standard are made available for use by the scientific community.

[1]  Maria Liakata,et al.  Guidelines for the annotation of General Scientific Concepts (GSC) , 2008 .

[2]  Yang Song,et al.  An Overview of Microsoft Academic Service (MAS) and Applications , 2015, WWW.

[3]  Dragomir R. Radev,et al.  The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics , 2008, LREC.

[4]  Oren Etzioni,et al.  Identifying Meaningful Citations , 2015, AAAI Workshop: Scholarly Big Data.

[5]  Daniel Ferrés,et al.  PDFdigest: an Adaptable Layout-Aware PDF-to-XML Textual Content Extractor for Scientific Articles , 2018, LREC.

[6]  L. Maggio,et al.  The Literature Review: A Foundation for High-Quality Medical Education Research. , 2016, Journal of graduate medical education.

[7]  Andrei Voronkov,et al.  PDFX: fully-automated PDF-to-XML conversion of scientific literature , 2013, ACM Symposium on Document Engineering.

[8]  Marco Pautasso,et al.  Ten Simple Rules for Writing a Literature Review , 2013, PLoS Comput. Biol..

[9]  Min-Yen Kan,et al.  Overview of the CL-SciSumm 2016 Shared Task , 2016, BIRNDL@JCDL.

[10]  Jungo Kasai,et al.  ScisummNet: A Large Annotated Corpus and Content-Impact Models for Scientific Paper Summarization with Citation Networks , 2019, AAAI.

[11]  Horacio Saggion,et al.  SUMMA. A Robust and Adaptable Summarization Tool , 2008, TAL.

[12]  K. Haller Conducting A Literature Review , 1988, MCN. The American journal of maternal child nursing.

[13]  James P. Callan,et al.  Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding , 2017, WWW.

[14]  Horacio Saggion,et al.  LaSTUS/TALN+INCO @ CL-SciSumm 2018 - Using Regression and Convolutions for Cross-document Semantic Linking and Summarization of Scholarly Literature , 2018, BIRNDL@SIGIR.

[15]  Min-Yen Kan,et al.  Towards Automated Related Work Summarization , 2010, COLING.

[16]  Xiaojun Wan,et al.  Automatic Generation of Related Work Sections in Scientific Papers: An Optimization Approach , 2014, EMNLP.

[17]  P. Dhavachelvan,et al.  Precision at K in Multilingual Information Retrieval , 2011 .

[18]  Brigitte Endres-Niggemeyer,et al.  How to Implement a Naturalistic Model of Abstracting: Four Core Working Steps of an Expert Abstractor , 1995, Inf. Process. Manag..

[19]  Simone Teufel Argumentative Zoning for Improved Citation Indexing , 2006, Computing Attitude and Affect in Text.

[20]  Min-Yen Kan,et al.  Insights from CL-SciSumm 2016: the faceted scientific document summarization Shared Task , 2017, International Journal on Digital Libraries.

[21]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[22]  Petr Knoth,et al.  An Analysis of the Microsoft Academic Graph , 2016, D Lib Mag..

[23]  Horacio Saggion,et al.  Learning Predicate Insertion Rules for Document Abstracting , 2011, CICLing.

[24]  Horacio Saggion,et al.  A Multi-Layered Annotated Corpus of Scientific Papers , 2016, LREC.

[25]  Horacio Saggion,et al.  What Sentence are you Referring to and Why? Identifying Cited Sentences in Scientific Literature , 2017, RANLP.

[26]  Christopher S. G. Khoo,et al.  Analysis of the Macro-Level Discourse Structure of Literature Reviews , 2011, Online Inf. Rev..

[27]  Kalina Bontcheva,et al.  Architectural elements of language engineering robustness , 2002, Natural Language Engineering.

[28]  Dragomir R. Radev,et al.  The computational linguistics summarization pilot task , 2014 .

[29]  Christopher S. G. Khoo,et al.  Deconstructing Human Literature Reviews – A Framework for Multi-Document Summarization , 2013, ENLG.

[30]  Maria Salamó,et al.  WARP-Text: a Web-Based Tool for Annotating Relationships between Pairs of Texts , 2018, COLING.

[31]  Simone Teufel,et al.  Detection of Implicit Citations for Sentiment Detection , 2012, ACL 2012.

[32]  Patrice Lopez,et al.  GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications , 2009, ECDL.

[33]  Marc Moens,et al.  Articles Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status , 2002, CL.

[34]  Horacio Saggion,et al.  Experiments in detection of implicit citations , 2018 .

[35]  Maria Liakata,et al.  Semantic Annotation of Papers: Interface & Enrichment Tool (SAPIENT) , 2009, BioNLP@HLT-NAACL.

[36]  Horacio Saggion,et al.  Dr. Inventor Framework: Extracting Structured Information from Scientific Publications , 2015, Discovery Science.

[37]  Dragomir R. Radev,et al.  Scientific Paper Summarization Using Citation Summary Networks , 2008, COLING.

[38]  Ignacio Iacobacci,et al.  Embedding Words and Senses Together via Joint Knowledge-Enhanced Training , 2016, CoNLL.

[39]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .