A Weakly-supervised Approach to Argumentative Zoning of Scientific Documents

Argumentative Zoning (AZ) -- analysis of the argumentative structure of a scientific paper -- has proved useful for a number of information access tasks. Current approaches to AZ rely on supervised machine learning (ML). Requiring large amounts of annotated data, these approaches are expensive to develop and port to different domains and tasks. A potential solution to this problem is to use weakly-supervised ML instead. We investigate the performance of four weakly-supervised classifiers on scientific abstract data annotated for multiple AZ classes. Our best classifier based on the combination of active learning and self-training outperforms our best supervised classifier, yielding a high accuracy of 81% when using just 10% of the labeled data. This result suggests that weakly-supervised learning could be employed to improve the practical applicability and portability of AZ across different information access tasks.

[1]  Marc Moens,et al.  Articles Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status , 2002, CL.

[2]  Daumé,et al.  Domain Adaptation meets Active Learning , 2010, HLT-NAACL 2010.

[3]  Carsten Lanquillon Learning from Labeled and Unlabeled Documents: A Comparative Study on Semi-Supervised Text Classification , 2000, PKDD.

[4]  Maria Liakata,et al.  Identifying the Information Structure of Scientific Abstracts: An Investigation of Three Different Schemes , 2010, BioNLP@ACL.

[5]  Dunja Mladenic,et al.  Text Classification with Active Learning , 2005, GfKl.

[6]  Robert Tibshirani,et al.  Classification by Pairwise Coupling , 1997, NIPS.

[7]  Craig A. Knoblock,et al.  Active + Semi-supervised Learning = Robust Multi-View Learning , 2002, ICML.

[8]  James R. Curran,et al.  Accurate Argumentative Zoning with Maximum Entropy models , 2009 .

[9]  Hagit Shatkay,et al.  Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users , 2008, Bioinform..

[10]  Danushka Bollegala,et al.  Semi-supervised Discourse Relation Classification with Structural Learning , 2011, CICLing.

[11]  Dietrich Rebholz-Schuhmann,et al.  Using argumentation to extract key sentences from biomedical abstracts , 2007, Int. J. Medical Informatics.

[12]  Andrea Esuli,et al.  Active Learning Strategies for Multi-Label Text Classification , 2009, ECIR.

[13]  Dale Schuurmans,et al.  Semi-Supervised Conditional Random Fields for Improved Sequence Segmentation and Labeling , 2006, ACL.

[14]  John C. Platt Using Analytic QP and Sparseness to Speed Training of Support Vector Machines , 1998, NIPS.

[15]  Johan Bos,et al.  Linguistically Motivated Large-Scale NLP with C&C and Boxer , 2007, ACL.

[16]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[17]  Zheng Chen,et al.  Effective multi-label active learning for text classification , 2009, KDD.

[18]  Patrick Ruch,et al.  Using argumentation to retrieve articles with similar citations: An inquiry into improving related articles search in the MEDLINE digital library , 2006, Int. J. Medical Informatics.

[19]  Naoaki Okazaki,et al.  Identifying Sections in Scientific Abstracts using Conditional Random Fields , 2008, IJCNLP.

[20]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[21]  Q. Mcnemar Note on the sampling error of the difference between correlated proportions or percentages , 1947, Psychometrika.

[22]  Rong Jin,et al.  Large-scale text categorization by batch mode active learning , 2006, WWW '06.

[23]  Nigel Collier,et al.  Zone analysis in biology articles as a basis for information extraction , 2006, Int. J. Medical Informatics.

[24]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[25]  Claire Grover,et al.  Extractive summarisation of legal texts , 2006, Artificial Intelligence and Law.

[26]  Jason Weston,et al.  Trading convexity for scalability , 2006, ICML.

[27]  Stefan Wrobel,et al.  Active Hidden Markov Models for Information Extraction , 2001, IDA.

[28]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[29]  Alexander Zien,et al.  Semi-Supervised Classification by Low Density Separation , 2005, AISTATS.

[30]  Steven Abney,et al.  Semisupervised Learning for Computational Linguistics , 2007 .

[31]  Jimmy J. Lin,et al.  Generative Content Models for Structural Analysis of Medical Abstracts , 2006, BioNLP@NAACL-HLT.

[32]  Anna Korhonen,et al.  Improving Verb Clustering with Automatically Acquired Selectional Preferences , 2009, EMNLP.

[33]  Gideon S. Mann,et al.  Efficient Computation of Entropy Gradient for Semi-Supervised Conditional Random Fields , 2007, NAACL.

[34]  J. Nocedal Updating Quasi-Newton Matrices With Limited Storage , 1980 .

[35]  Simone Teufel,et al.  Towards Domain-Independent Argumentative Zoning: Evidence from Chemistry and Computational Linguistics , 2009, EMNLP.

[36]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[37]  Klaus Brinker,et al.  On Active Learning in Multi-label Classification , 2005, GfKl.

[38]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[39]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[40]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[41]  Patrick Ruch,et al.  Using Argumentation to Retrieve Articles with Similar Citations from MEDLINE , 2004, NLPBA/BioNLP.

[42]  Nigel Collier,et al.  A baseline feature set for learning rhetorical zones using full articles in the biomedical domain , 2005, SKDD.

[43]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[44]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[45]  Barbara Plank,et al.  Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10) , 2010 .

[46]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[47]  Simone Teufel,et al.  Corpora for the Conceptualisation and Zoning of Scientific Papers , 2010, LREC.