Using Typed Dependencies to Study and Recognise Conceptualisation Zones in Biomedical Literature

In the biomedical domain, authors publish their experiments and findings using a quasi-standard coarse-grained discourse structure, which starts with an introduction that sets up the motivation, continues with a description of the materials and methods, and concludes with results and discussions. Over the course of the years, there has been a fair amount of research done in the area of scientific discourse analysis, with a focus on performing automatic recognition of scientific artefacts/conceptualisation zones from the raw content of scientific publications. Most of the existing approaches use Machine Learning techniques to perform classification based on features that rely on the shallow structure of the sentence tokens, or sentences as a whole, in addition to corpus-driven statistics. In this article, we investigate the role carried by the deep (dependency) structure of the sentences in describing their rhetorical nature. Using association rule mining techniques, we study the presence of dependency structure patterns in the context of a given rhetorical type, the use of these patterns in exploring differences in structure between the rhetorical types, and their ability to discriminate between the different rhetorical types. Our final goal is to provide a series of insights that can be used to complement existing classification approaches. Experimental results show that, in particular in the context of a fine-grained multi-class classification context, the association rules emerged from the dependency structure are not able to produce uniform classification results. However, they can be used to derive discriminative pair-wise classification mechanisms, in particular for some of the most ambiguous types.

[1]  Joakim Nivre,et al.  Dependency Parsing , 2009, Lang. Linguistics Compass.

[2]  Alan Ruttenberg,et al.  The SWAN biomedical discourse ontology , 2008, J. Biomed. Informatics.

[3]  Noah A. Smith,et al.  Dependency Parsing , 2009, Encyclopedia of Artificial Intelligence.

[4]  William C. Mann,et al.  RHETORICAL STRUCTURE THEORY: A THEORY OF TEXT ORGANIZATION , 1987 .

[5]  György Szarvas,et al.  Hedge Classification in Biomedical Texts with a Weakly Supervised Selection of Keywords , 2008, ACL.

[6]  Simone Teufel Towards Discipline-Independent Argumentative Zoning : Evidence from Chemistry and Computational Linguistics , 2009 .

[7]  Simon Buckingham Shum,et al.  Modelling discourse in contested domains: A semiotic and cognitive framework , 2006, Int. J. Hum. Comput. Stud..

[8]  Aaron N. Kaplan,et al.  Discovering Paradigm Shift Patterns in Biomedical Abstracts: Application to Neurodegenerative Diseases , 2005 .

[9]  Christopher D. Manning,et al.  Stanford typed dependencies manual , 2010 .

[10]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[11]  Siegfried Handschuh,et al.  SALT - Semantically Annotated LaTeX for scientific publications , 2007 .

[12]  Christopher D. Manning,et al.  The Stanford Typed Dependencies Representation , 2008, CF+CDPE@COLING.

[13]  Simone Teufel,et al.  Towards Domain-Independent Argumentative Zoning: Evidence from Chemistry and Computational Linguistics , 2009, EMNLP.

[14]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[15]  Anita de Waard,et al.  Verb Form Indicates Discourse Segment Type in Biological Research Papers: Experimental Evidence. , 2012 .

[16]  Hagit Shatkay,et al.  Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users , 2008, Bioinform..

[17]  Halil Kilicoglu,et al.  Recognizing speculative language in biomedical research articles: a linguistically motivated perspective , 2008, BMC Bioinformatics.

[18]  Hagit Shatkay,et al.  New directions in biomedical text annotation: definitions, guidelines and corpus construction , 2006, BMC Bioinformatics.

[19]  Maria Liakata,et al.  The ART Corpus , 2009 .

[20]  Ted Briscoe,et al.  Weakly Supervised Learning for Hedge Classification in Scientific Literature , 2007, ACL.

[21]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[22]  Anita de Waard,et al.  A pragmatic structure for research articles , 2007, ICPW '07.

[23]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[24]  Dietrich Rebholz-Schuhmann,et al.  Automatic recognition of conceptualization zones in scientific articles and two life science applications , 2012, Bioinform..

[25]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[26]  Nigel Collier,et al.  Zone analysis in biology articles as a basis for information extraction , 2006, Int. J. Medical Informatics.

[27]  F.A.P. Harmsze A modular structure for scientific articles in an electronic environment , 2000 .

[28]  Simone Teufel,et al.  Corpora for the Conceptualisation and Zoning of Scientific Papers , 2010, LREC.

[29]  Simone Teufel,et al.  Argumentative zoning information extraction from scientific text , 1999 .

[30]  Maria Liakata,et al.  An ontology methodology and CISP-the proposed Core Information about Scientific Papers , 2007 .

[31]  Claire Grover,et al.  In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC , 2006 .

[32]  John Conley,et al.  Technical Report 2 , 2014 .