An Architecture for Language Processing for Scientific Texts

We describe the architecture for language processing adopted on the eScience project ‘Extracting the Science from Scientific Publications’ (nicknamed SciBorg). In this approach, papers from different sources are first processed to give a common XML format (SciXML). Language processing modules operate on the SciXML in an architecture that allows for (partially) parallel deep and shallow processing and for a flexible combination of domain-independent and domain-dependent techniques. Robust Minimal Recursion Semantics (RMRS) acts both as a language for representing the output of processing and as an integration language for combining different modules. Language processing produces RMRS markup represented as standoff annotation on the original SciXML. Information extraction (IE) of various types is defined as operating on RMRSs. Rhetorical analysis of the texts also partially depends on IE-like patterns and supports novel methods of information access.

[1]  Marti A. Hearst Direction-based text interpretation as an information access refinement , 1992 .

[2]  P. Jacobs Text-based intelligent systems: current research and practice in information extraction and retrieval , 1992 .

[3]  Dan Flickinger,et al.  Minimal Recursion Semantics: An Introduction , 2005 .

[4]  Tobias Ruland,et al.  Making the most of multiplicity: a multi-parser multi-strategy architecture for the robust processing of spoken language , 1998, ICSLP.

[5]  Jean Carletta,et al.  An annotation scheme for discourse-level argumentation in research articles , 1999, EACL.

[6]  Marc Moens,et al.  What's Yours and What's Mine: Determining Intellectual Attribution in Scientific Text , 2000, EMNLP.

[7]  Wolfgang Wahlster,et al.  Verbmobil: Foundations of Speech-to-Speech Translation , 2000, Artificial Intelligence.

[8]  Dan Flickinger,et al.  An Open Source Grammar Development Environment and Broad-coverage English Grammar Using HPSG , 2000, LREC.

[9]  Jörg Spilker,et al.  Combining Analyses from Various Parsers , 2000 .

[10]  Stephan Oepen,et al.  Collaborative language engineering : a case study in efficient grammar-based processing , 2002 .

[11]  Ann Copestake,et al.  Implementing typed feature structure grammars , 2001, CSLI lecture notes series.

[12]  Hans Uszkoreit,et al.  New Chances for Deep Linguistic Processing , 2002 .

[13]  Ted Briscoe,et al.  Robust Accurate Statistical Annotation of General Text , 2002, LREC.

[14]  Claire Grover,et al.  Summarising Legal Texts: Sentential Tense and Argumentative Roles , 2003, HLT-NAACL 2003.

[15]  Nigel Collier,et al.  An Annotation Scheme for a Rhetorical Analysis of Biology Articles , 2004, LREC.

[16]  Andreas Eisele,et al.  The DeepThought Core Architecture Framework , 2004, LREC.

[17]  Dan Tidhar,et al.  Retrieving Hierarchical Text Structure from Typeset Scientific Articles – a Prerequisite for E-Science Text Mining , 2005 .

[18]  Ann A. Copestake,et al.  A Standoff Annotation Interface between DELPH-IN Components , 2006, NLPXML@EACL.

[19]  Andreas Vlachos,et al.  Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain , 2006, BioNLP@NAACL-HLT.

[20]  Simone Teufel,et al.  Flexible Interfaces in the Application of Language Technology to an eScience Corpus , 2006 .

[21]  Simone Teufel Argumentative Zoning for Improved Citation Indexing , 2006, Computing Attitude and Affect in Text.

[22]  Simone Teufel,et al.  An annotation scheme for citation function , 2009, SIGDIAL Workshop.

[23]  Peter Murray-Rust,et al.  High-Throughput Identification of Chemistry in Life Science Texts , 2006, CompLife.

[24]  Ulrich Schäfer,et al.  Preprocessing and Tokenisation Standards in DELPH-IN Tools , 2006, LREC.

[25]  Simone Teufel,et al.  Argumentative Zoning Applied to Critiquing Novices' Scientific Abstracts , 2006, Computing Attitude and Affect in Text.

[26]  Berthold Crysmann,et al.  Question answering from structured knowledge sources , 2007, J. Appl. Log..