Corpora for the Conceptualisation and Zoning of Scientific Papers

We present two complementary annotation schemes for sentence based annotation of full scientific papers, CoreSC and AZ-II, which have been applied to primary research articles in chemistry. The AZ scheme is based on the rhetorical structure of a scientific paper and follows the knowledge claims made by the authors. It has been shown to be reliably annotated by independent human coders and has proven useful for various information access tasks. AZ-II is its extended version, which has been successfully applied to chemistry. The CoreSC scheme takes a different view of scientific papers, treating them as the humanly readable representations of scientific investigations. It therefore seeks to retrieve the structure of the investigation from the paper as generic high-level Core Scientific Concepts (CoreSC). CoreSCs have been annotated by 16 chemistry experts over a total of 265 full papers in physical chemistry and biochemistry. We describe the differences and similarities between the two schemes in detail and present the two corpora produced using each scheme. There are 36 shared papers in the corpora, which allows us to quantitatively compare aspects of the annotation schemes. We show the correlation between the two schemes, their strengths and weaknesses and discuss the benefits of combining a rhetorical based analysis of the papers with a content-based one.

[1]  Simone Teufel,et al.  Towards Domain-Independent Argumentative Zoning: Evidence from Chemistry and Computational Linguistics , 2009, EMNLP.

[2]  Anna Korhonen,et al.  The first step in the development of text mining technology for cancer risk assessment: identifying and organizing scientific evidence in risk assessment literature , 2009, BMC Bioinformatics.

[3]  Ted Briscoe,et al.  Weakly Supervised Learning for Hedge Classification in Scientific Literature , 2007, ACL.

[4]  János Csirik,et al.  The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes , 2008, BMC Bioinformatics.

[5]  Simone Teufel Towards Discipline-Independent Argumentative Zoning : Evidence from Chemistry and Computational Linguistics , 2009 .

[6]  Maria Liakata,et al.  An ontology methodology and CISP-the proposed Core Information about Scientific Papers , 2007 .

[7]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[8]  K. Krippendorff Krippendorff, Klaus, Content Analysis: An Introduction to its Methodology . Beverly Hills, CA: Sage, 1980. , 1980 .

[9]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[10]  B. Everitt,et al.  Large sample standard errors of kappa and weighted kappa. , 1969 .

[11]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[12]  Maria Liakata,et al.  Semantic Annotation of Papers: Interface & Enrichment Tool (SAPIENT) , 2009, BioNLP@HLT-NAACL.

[13]  Sophia Ananiadou,et al.  Construction of an annotated corpus to support biomedical information extraction , 2009, BMC Bioinformatics.

[14]  James R. Curran,et al.  Challenges for automatically extracting molecular interactions from full-text articles , 2009, BMC Bioinformatics.

[15]  Maria Liakata,et al.  The ART Corpus , 2009 .

[16]  Ross D King,et al.  An ontology of scientific experiments , 2006, Journal of The Royal Society Interface.

[17]  Naoaki Okazaki,et al.  Identifying Sections in Scientific Abstracts using Conditional Random Fields , 2008, IJCNLP.

[18]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[19]  J. Carlin,et al.  Bias, prevalence and kappa. , 1993, Journal of clinical epidemiology.