Choosing an XML database for linguistically annotated corpora

XML has become the de-facto standard for representing linguistically annotated corpora. It seems safe to assume that storing and querying an XML-encoded, annotated corpus in an XML database is a straightforward procedure. In reality, however, it is not. This article aims to provide guidelines for deciding whether to use an XML database and how to choose a suitable product. To this end we examine the following questions: Which aspects should be considered before choosing to store an XML-encoded annotated corpus in an XML database? Which facilities does a database need to provide in order to be suitable for storing and querying annotated corpora? Do current XML databases offer these facilities, and, if not, can they be added?

[1]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[2]  Christian Chiarcos,et al.  Ontology-Based XQuery’ing of XML-Encoded Language Resources on Multiple Annotation Layers , 2008, LREC.

[3]  Torsten Grust,et al.  MonetDB/XQuery: a fast XQuery processor powered by a relational engine , 2006, SIGMOD Conference.

[4]  Wolfgang Meier,et al.  eXist: An Open Source Native XML Database , 2002, Web, Web-Services, and Database Systems.

[5]  Valentin Jijkoun,et al.  Representing and Querying Multi-dimensional Markup for Question Answering , 2006, NLPXML@EACL.

[6]  Christian Chiarcos,et al.  An OWL-and XQuery-based mechanism for the retrieval of linguistic patterns from XML-corpora , 2007 .

[7]  Arjen P. de Vries,et al.  Efficient XQuery Support for Stand-Off Annotation , 2006, XIME-P.

[8]  Vojkan Mihajlovic,et al.  Score region algebra : a flexible framework for structured information retrieval , 2006 .

[9]  Jean Carletta,et al.  Proceedings of 3rd Workshop on NLP and XML (NLPXML-2003) Language Technology and the Semantic Web , 2003 .

[10]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[11]  David McKelvie,et al.  Hyperlink semantics for standoff markup of read-only documents , 1997 .

[12]  Sylvie Calabretto,et al.  Encoding and Querying Multi-Structured Documents , 2006, ELPUB.

[13]  Sihem Amer-Yahia,et al.  GalaTex: a conformant implementation of the XQuery full-text language , 2005, WWW '05.

[14]  Richard Eckart,et al.  An XML-based data model for flexible representation and query of linguistically interpreted corpora , 2007 .

[15]  Richard Eckart,et al.  Towards A Modular Data Model For Multi-Layer Annotated Corpora , 2006, ACL 2006.

[16]  Andrew Trotman,et al.  Narrowed Extended XPath I (NEXI) , 2004, INEX.

[17]  Dennis Tsichritzis,et al.  The ANSI/X3/SPARC DBMS Framework Report of the Study Group on Dabatase Management Systems , 1978, Inf. Syst..

[18]  Loredana Afanasiev,et al.  An analysis of XQuery benchmarks , 2008, Inf. Syst..

[19]  Nancy Ide,et al.  XCES: An XML-based Encoding Standard for Linguistic Corpora , 2000, LREC.

[20]  Nancy Ide,et al.  GrAF: A Graph-based Format for Linguistic Annotations , 2007, LAW@ACL.