Facilitating the Analysis of Discourse Phenomena in an Interoperable NLP Platform

The analysis of discourse phenomena is essential in many natural language processing (NLP) applications. The growing diversity of available corpora and NLP tools brings a multitude of representation formats. In order to alleviate the problem of incompatible formats when constructing complex text mining pipelines, the Unstructured Information Management Architecture (UIMA) provides a standard means of communication between tools and resources. U-Compare, a text mining workflow construction platform based on UIMA, further enhances interoperability through a shared system of data types, allowing free combination of compliant components into workflows. Although U-Compare and its type system already support syntactic and semantic analyses, support for the analysis of discourse phenomena was previously lacking. In response, we have extended the U-Compare type system with new discourse-level types. We illustrate processing and visualisation of discourse information in U-Compare by providing several new deserialisation components for corpora containing discourse annotations. The new U-Compare is downloadable from http://nactem.ac.uk/ucompare.

[1]  Sophia Ananiadou,et al.  Enriching a biomedical event corpus with meta-knowledge annotation , 2011, BMC Bioinformatics.

[2]  Alexander A. Morgan,et al.  Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup , 2003, ISMB.

[3]  Scott Weinstein,et al.  Centering: A Framework for Modeling the Local Coherence of Discourse , 1995, CL.

[4]  Anita de Waard,et al.  Identifying Claimed Knowledge Updates in Biomedical Research Articles , 2012, ACL 2012.

[5]  Jun'ichi Tsujii,et al.  New challenges for text mining: mapping between text and manually curated pathways , 2008, BMC Bioinformatics.

[6]  Rada Mihalcea,et al.  Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Langu , 2011, ACL 2011.

[7]  Allan Hanbury,et al.  Scaling Up High-Value Retrieval to Medium-Volume Data , 2010, IRFC.

[8]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[9]  K. Bretonnel Cohen,et al.  U-Compare: share and compare text mining tools with UIMA , 2009, Bioinform..

[10]  Hong Yu,et al.  The biomedical discourse relation bank , 2011, BMC Bioinformatics.

[11]  Sophia Ananiadou,et al.  Accelerating the annotation of sparse named entities by dynamic sentence selection , 2008, BMC Bioinformatics.

[12]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[13]  Sophia Ananiadou,et al.  Building a Coreference-Annotated Corpus from the Domain of Biochemistry , 2011, BioNLP@ACL.

[14]  Sophia Ananiadou,et al.  Argo: an integrative, interactive, text mining-based workbench supporting curation , 2012, Database J. Biol. Databases Curation.

[15]  Hagit Shatkay,et al.  New directions in biomedical text annotation: definitions, guidelines and corpus construction , 2006, BMC Bioinformatics.

[16]  Burr Settles,et al.  Biomedical Named Entity Recognition using Conditional Random Fields and Rich Feature Sets , 2004, NLPBA/BioNLP.

[17]  Jun'ichi Tsujii,et al.  Corpus annotation for mining biomedical events from literature , 2008, BMC Bioinformatics.

[18]  Sampo Pyysalo,et al.  BioCause: Annotating and analysing causality in the biomedical domain , 2013, BMC Bioinformatics.

[19]  Ryan Gabbard,et al.  Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters , 2011, ACL.

[20]  Joyce Yue Chai,et al.  Discourse processing for context question answering based on linguistic knowledge , 2007, Knowl. Based Syst..

[21]  Anand Kulkarni,et al.  Scientific Laboratory Information Management System: Tissue Bank , 2008, BMC Bioinformatics.

[22]  Sophia Ananiadou,et al.  Construction of an annotated corpus to support biomedical information extraction , 2009, BMC Bioinformatics.

[23]  Nigel Collier,et al.  Zone analysis in biology articles as a basis for information extraction , 2006, Int. J. Medical Informatics.

[24]  Allan Hanbury,et al.  Advances in Multidisciplinary Retrieval, First Information Retrieval Facility Conference, IRFC 2010, Vienna, Austria, May 31, 2010. Proceedings , 2010, IRFC.

[25]  George Hripcsak,et al.  Methodological Review: A review of causal inference for biomedical informatics , 2011 .

[26]  Sampo Pyysalo,et al.  BioNLP Shared Task 2011: Supporting Resources , 2011, BioNLP@ACL.

[27]  Ted Briscoe,et al.  Weakly Supervised Learning for Hedge Classification in Scientific Literature , 2007, ACL.

[28]  Jian Su,et al.  Coreference Resolution in Biomedical Texts: a Machine Learning Approach , 2008, Ontologies and Text Mining for Life Sciences.

[29]  Dietrich Rebholz-Schuhmann,et al.  Ontologies and Text Mining for Life Sciences: Current Status and Future Perspectives, 24.03. - 28.03.2008 , 2008, Ontologies and Text Mining for Life Sciences.

[30]  Graham Wilcock,et al.  Proceedings of the 5th Workshop on NLP and XML: Multi-Dimensional Markup in Natural Language Processing (NLPXML-2006) , 2006 .

[31]  Sophia Ananiadou,et al.  Extracting semantically enriched events from biomedical literature , 2012, BMC Bioinformatics.

[32]  Ulrich Schäfer Middleware for Creating and Combining Multi-dimensional NLP Markup , 2006, NLPXML@EACL.

[33]  Eric C. Rouchka,et al.  Buffered codons in human transcriptional units , 2008, BMC Bioinformatics.

[34]  Padmini Srinivasan,et al.  Categorization of Sentence Types in Medical Abstracts , 2003, AMIA.

[35]  Simone Teufel,et al.  Corpora for the Conceptualisation and Zoning of Scientific Papers , 2010, LREC.

[36]  Sophia Ananiadou,et al.  Using Workflows to Explore and Optimise Named Entity Recognition for Chemistry , 2011, PloS one.