FoLiA: A practical XML Format for Linguistic Annotation - a descriptive and comparative study

In this paper we present FoLiA, a Format for Linguistic Annotation, and conduct a comparative study with other annotation schemes, including the Linguistic Annotation Framework (LAF), the Text Encoding Initiative (TEI) and Text Corpus Format (TCF). An additional point of focus is the interoperability between FoLiA and metadata standards such as the Component MetaData Infrastructure (CMDI), as well as data category registries such as ISOcat. The aim of the paper is to present a clear image of the capabilities of FoLiA and how it relates to other formats. This should open discussion and aid users in their decision for a particular format. FoLiA is a practically-oriented XML-based annotation format for the representation of language resources, explicitly supporting a wide variety of annotation types. It introduces a flexible and uniform paradigm and a representation independent of language or label set. It is designed to be highly expressive, generic, and formalised, whilst at the same time focussing on being as practical as possible to ease its adoption and implementation. The aspiration is to offer a generic format for storage, exchange, and machine-processing of linguistically annotated documents, preventing users as well as software tools from having to cope with a wide variety of different formats, which in the field regularly causes convertibility issues and proliferation of ad-hoc formats. FoLiA emerged from such a practical need in the context of Computational Linguistics in the Netherlands and Flanders. It has been successfully adopted by numerous projects within this community. FoLiA was developed in a bottom-up fashion, with special emphasis on software libraries and tools to handle it.

[1]  Piotr Banski,et al.  TEI P5 as a Text Encoding Standard for Multilevel Corpus Annotation , 2010, DH.

[2]  Maik Stührenberg The TEI and Current Standards for Structuring Linguistic Data. An Overview , 2012 .

[3]  Steven J. DeRose,et al.  Xml linking language (xlink), version 1. 0 , 2000, WWW 2000.

[4]  Laurent Romary,et al.  International standard for a linguistic annotation framework , 2003, HLT-NAACL 2003.

[5]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[6]  Wolfgang Lezius,et al.  A Description Language for Syntactically Annotated Corpora , 2000, COLING.

[7]  Nancy Ide,et al.  XCES: An XML-based Encoding Standard for Linguistic Corpora , 2000, LREC.

[8]  C. M. Sperberg-McQueen,et al.  Extensible markup language , 1997 .

[9]  David G. Durand,et al.  Refining our Notion of What Text Really Is: The Problem of Overlapping Hierarchies , 1993 .

[10]  Nelleke Oostdijk,et al.  The Construction of a 500-Million-Word Reference Corpus of Contemporary Written Dutch , 2013, Essential Speech and Language Technology for Dutch.

[11]  M. van Gompel FoLiA: Format for Linguistic Annotation (Version 0.10.0 - Revision 3.3). Documentation [LST-14-01] , 2014 .

[12]  Marc Kemps-Snijders,et al.  ISOcat: remodelling metadata for language resources , 2009, Int. J. Metadata Semant. Ontologies.

[13]  Andreas Witt,et al.  [tiger2] As a standardized serialisation for ISO 24615 - SynAF , 2012 .

[14]  Nancy Ide,et al.  GrAF: A Graph-based Format for Linguistic Annotations , 2007, LAW@ACL.

[15]  Gertjan van Noord,et al.  Alpino: Wide-coverage Computational Analysis of Dutch , 2000, CLIN.

[16]  Adam Przepiórkowski,et al.  TEI P5 as an XML Standard for Treebank Encoding , 2009 .

[17]  Éric Villemonte de la Clergerie,et al.  MAF: a Morphosyntactic Annotation Framework , 2005 .

[18]  Erhard W. Hinrichs,et al.  A Corpus Representation Format for Linguistic Web Services: The D-SPIN Text Corpus Format and its Relationship with ISO Standards , 2010, LREC.

[19]  Piek T. J. M. Vossen,et al.  Computer Assisted Semantic Annotation in the DutchSemCor Project , 2010, LREC.

[20]  Antinus Nijholt,et al.  Language and Computers: Studies in Practical Linguistics , 2002 .

[21]  Martin Reynaert Character confusion versus focus word-based correction of spelling and OCR variants in corpora , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[22]  Andreas Witt,et al.  A pragmatic approach to XML interoperability — the Component Metadata Infrastructure (CMDI) , 2011 .

[23]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[24]  David G. Durand,et al.  What is text, really? , 1990, J. Comput. High. Educ..

[25]  Jan Odijk,et al.  The CLARIN-NL Project , 2010, LREC.

[26]  Erhard W. Hinrichs,et al.  WebLicht: Web-based LRT Services in a Distributed eScience Infrastructure , 2010, LREC.

[27]  Jan Odijk Recent Developments in CLARIN-NL , 2012, LREC.

[28]  Thierry Declerck A Framework for Standardized Syntactic Annotation , 2008, LREC.