Ontology-Based Interface Specifications for a NLP Pipeline Architecture

The high level of heterogeneity between linguistic annotations usually complicates the interoperability of processing modules within an NLP pipeline. In this paper, a framework for the interoperation of NLP components, based on a data-driven architecture, is presented. Here, ontologies of linguistic annotation are employed to provide a conceptual basis for the tagset-neutral processing of linguistic annotations. The framework proposed here is based on a set of structured OWL ontologies: a reference ontology, a set of annotation models which formalize different annotation schemes, and a declarative linking between these, specified separately. This modular architecture is particularly scalable and flexible as it allows for the integration of different reference ontologies of linguistic annotations in order to overcome the absence of a consensus for an ontology of linguistic terminology. Our proposal originates from three lines of research from different fields: research on annotation type systems in UIMA; the ontological architecture OLiA, originally developed for sustainable documentation and annotation-independent corpus browsing, and the ontologies of the OntoTag model, targeted towards the processing of linguistic annotations in Semantic Web applications. We describe how UIMA annotations can be backed up by ontological specifications of annotation schemes as in the OLiA model, and how these are linked to the OntoTag ontologies, which allow for further ontological processing.

[1]  John Hughes,et al.  AMALGAM: Automatic Mapping Among Lexico-Grammatical Annotation Models , 1994 .

[2]  Uwe Reyle,et al.  Ontology-based semantic construction underspecification and disambiguation , 2003 .

[3]  William Lewis,et al.  The Semantics of Markup: Mapping Legacy Markup Schemas to a Common Semantics , 2004, NLPXML@ACL.

[4]  Antonio Pareja-Lora,et al.  A SEMANTIC WEB PAGE LINGUISTIC ANNOTATION MODEL , 2002 .

[5]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[6]  Nissim Francez,et al.  Categorial Grammar with Ontology-refined Types , 2004 .

[7]  Thilo Götz,et al.  Design and implementation of the UIMA Common Analysis System , 2004, IBM Syst. J..

[8]  Ulrich Heid,et al.  Formalising Multi-layer Corpora in OWL DL - Lexicon Modelling, Querying and Consistency Control , 2008, IJCNLP.

[9]  Mark Fischetti,et al.  Weaving the web - the original design and ultimate destiny of the World Wide Web by its inventor , 1999 .

[11]  Graham Wilcock An OWL Ontology for HPSG , 2007, ACL.

[12]  Sophia Ananiadou,et al.  An Annotation Type System for a Data-Driven NLP Pipeline , 2007, LAW@ACL.

[13]  Nancy Ide,et al.  A Registry of Standard Data Categories for Linguistic Annotation , 2004, LREC.

[14]  Andreas Witt,et al.  E-MELD 2006 Workshop on Digital Language Documentation: Tools and Standards - The State of the Art Avoiding Data Graveyards: From Heterogeneous Data Collected in Multiple Research Projects to Sustainable Linguistic Resources , 2006 .

[15]  Geoffrey Leech,et al.  EAGLES recommendations for the morphosyntactic annotation of corpora , 1996 .

[16]  Christian Chiarcos,et al.  Ontology-Based XQuery’ing of XML-Encoded Language Resources on Multiple Annotation Layers , 2008, LREC.

[17]  Asunción Gómez-Pérez,et al.  OntoTag: XML/RDF(S)/OWL Semantic Web Page Annotation in ContentWeb , 2003 .

[18]  Jin-Dong Kim,et al.  The GENIA corpus: an annotated research abstract corpus in molecular biology domain , 2002 .

[19]  Scott Farrar,et al.  A linguistic ontology for the semantic web , 2003 .

[20]  Antonio Pareja-Lora,et al.  OntoTag's Linguistic Ontologies: Enhancing Higher Level and Semantic Web Annotations , 2004, LREC.

[21]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[22]  Yolanda Gil,et al.  Incremental formalization of document annotations through ontology-based paraphrasing , 2004, WWW '04.

[23]  Mitchell P. Marcus,et al.  OntoNotes: The 90% Solution , 2006, NAACL.