论文信息 - The ART Corpus

The ART Corpus

Within the JISC funded ART project (University of Wales, Aberystwyth http://www.aber.ac.uk/compsci/Research/bio/art/) we developed a tool (SAPIENT) to allow the annotation of scientific papers with core scientific concepts (e.g. 'Goal', 'Hypothesis', 'Experiment', 'Method', 'Result', 'Conclusion', 'Motivation', 'Observation'). These concepts constitute the CISP meta-data and were verified through an on-line survey addressed to researchers. The CISP meta-data were accompanied by a set of guidelines for their implementation as an annotation scheme. We worked with chemistry experts, who used the guidelines and SAPIENT to create a corpus of 225 papers manually annotated with CISP concepts. The sustainability of and the benefits obtained from annotating papers with CISP meta-data will be investigated by the JISC funded SAPIENT Automation (SAPIENTA) project. Source Data: The source data consists of text in XML format, encoded in unicode (utf-8 character set). The XML schema used is a variant of SciXML, which can be provided upon request. The differences between the ART Corpus XML and SciXML consist in the following: * An element has been added at the same level as the , and elements. The latter elements can occur within a element according to the SciXML schema. This tag covers all kinds of sentences. That is, there is no distinction between sentences in the abstract (denoted as in SciXML) and sentences in the main paper (denoted as in SciXML) or sentences within equations ( ) and examples ( ). * The element has an id (sid) and can include an element. * The element has the attributes "type", "conceptID", "novelty" and "advantage". For more details please refer to the annotation guidelines. Annotation: The goal of the annotation was to mark-up core scientific concepts in research papers. Papers from the domains of chemistry and biochemistry were chosen as a proof of principle approach. Annotation was performed by 20 chemistry experts, at PhD or postdoctorate level with excellent knowledge of English. The annotators selected were given an annotation package consisting of a set of guidelines[3] for annotating papers with CISP, the SAPIENT system[4] and its manual, as well as an example paper which had already been annotated. Most of this material is available for download from: http://www.aber.ac.uk/compsci/Research/bio/art/sapient. The annotation guidelines are available upon request. Work with annotators was conducted in three phases over a period of six months. In phase I (training phase) all 20 annotators were sent the same four papers to annotate using SAPIENT and the annotation guidelines, in order to familiarise themselves with the process. Individual annotators' results were analysed meticulously at this stage and were used to improve the guidelines. For Stage II, (evaluation phase) the aim was to evaluate both the annotators and the guidelines. A preliminary evaluation of the experts' agreement was conducted based on a sample of 41 papers (5,000 sentences) which were annotated by 16 experts, divided in non-overlapping groups of 3 experts. The results show significant agreement between annotators, given the difficulty of the task (an average kappa co-efficient of 0.55 per group). The 9 experts from phase II who had the highest average inter-annotator agreement were selected for phase III. The latter constitutes the actual creation of the ART Corpus, through the annotation of 225 papers. Distribution: The ART corpus is available as a 2.2 MB tar.gz file which expands to 12 MB. It consists of 225 papers (> 1 million words, 35,040 sentences). The corpus is available as a collection of 225 .xml files, where each file corresponds to a separate paper whose sentences have been annotated individually with core scientific concepts. The papers have been arranged into 9 folders, corresponding to each of the 9 annotators. These papers can be processed individually, per folder or as a batch by any script for handling XML. One can display papers individually by using the SAPIENT software[4], which was used for creating the original annotations. For instructions on how to use SAPIENT to display the software please refer to SAPIENT_FAQ.txt (both can be downloaded from: http://www.aber.ac.uk/compsci/Research/bio/art/sapient.) For any requests/details regarding the corpus please contact Dr Maria Liakata (mal@aber.ac.uk).

Maria Liakata | Larisa N. Soldatova