Rhetorical Move Detection in English Abstracts: Multi-label Sentence Classifiers and their Annotated Corpora

The relevance of automatically identifying rhetorical moves in scientific texts has been widely acknowledged in the literature. This study focuses on abstracts of standard research papers written in English and aims to tackle a fundamental limitation of current machine-learning classifiers: they are mono-labeled, that is, a sentence can only be assigned one single label. However, such approach does not adequately reflect actual language use since a move can be realized by a clause, a sentence, or even several sentences. Here, we present MAZEA (Multi-label Argumentative Zoning for English Abstracts), a multi-label classifier which automatically identifies rhetorical moves in abstracts but allows for a given sentence to be assigned as many labels as appropriate. We have resorted to various other NLP tools and used two large training corpora: (i) one corpus consists of 645 abstracts from physical sciences and engineering (PE) and (ii) the other corpus is made up of 690 from life and health sciences (LH). This paper presents our preliminary results and also discusses the various challenges involved in multi-label tagging and works towards satisfactory solutions. In addition, we also make our two training corpora publicly available so that they may serve as benchmark for this new task.

[1]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[2]  Padmini Srinivasan,et al.  Categorization of Sentence Types in Medical Abstracts , 2003, AMIA.

[3]  Naoaki Okazaki,et al.  Identifying Sections in Scientific Abstracts using Conditional Random Fields , 2008, IJCNLP.

[4]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[5]  Matsumoto Yuji,et al.  Semi - supervised sentence classification for MEDLINE documents , 2004 .

[6]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[7]  Marc Moens,et al.  Articles Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status , 2002, CL.

[8]  Jason S. Chang,et al.  Computational Analysis of Move Structures in Academic Abstracts , 2006, ACL.

[9]  Grigorios Tsoumakas,et al.  Random k -Labelsets: An Ensemble Method for Multilabel Classification , 2007, ECML.

[10]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[11]  J. Swales Research Genres: Explorations and Applications , 2004 .

[12]  Dietrich Rebholz-Schuhmann,et al.  Using argumentation to extract key sentences from biomedical abstracts , 2007, Int. J. Medical Informatics.

[13]  Elena Cotos,et al.  Automatic Identification of Discourse Moves in Scientific Article Introductions , 2008 .

[14]  Simone Teufel,et al.  Argumentative Zoning Applied to Critiquing Novices' Scientific Abstracts , 2006, Computing Attitude and Affect in Text.

[15]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[16]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[17]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[18]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[19]  Janyce Wiebe,et al.  Computing Attitude and Affect in Text: Theory and Applications , 2005, The Information Retrieval Series.

[20]  John M. Swales,et al.  Abstracts and the Writing of Abstracts , 2009 .

[21]  Jimmy J. Lin,et al.  Generative Content Models for Structural Analysis of Medical Abstracts , 2006, BioNLP@NAACL-HLT.

[22]  Laurence Anthony,et al.  Mover: a machine learning tool to assist in the reading and writing of technical papers , 2003 .