FoLiA in Practice. The Infrastructure of a Linguistic Annotation Format

We present an overview of the software and data infrastructure for FoLiA, a Format for Linguistic Annotation developed within the scope of the CLARIN-NL project and other projects. FoLiA aims to provide a single unified file format accommodating a wide variety of linguistic annotation types, preventing the proliferation of different formats for different annotation types. FoLiA is being developed in a bottom-up and practice-driven fashion. We have invested mainly in the creation of a rich infrastructure of tools that enable developers and end-users to work with the format. This work will present the current state of this infrastructure.

[1]  W. Spooren,et al.  Diachronic changes in subjectivity and stance: A corpus linguistic study of Dutch news texts , 2012 .

[2]  Andreas Witt,et al.  A pragmatic approach to XML interoperability — the Component Metadata Infrastructure (CMDI) , 2011 .

[3]  Martin Reynaert,et al.  FoLiA: A practical XML Format for Linguistic Annotation - a descriptive and comparative study , 2014, CLIN 2014.

[4]  Antal van den Bosch Ucto: Unicode Tokeniser , 2012 .

[5]  Erhard W. Hinrichs,et al.  A Corpus Representation Format for Linguistic Web Services: The D-SPIN Text Corpus Format and its Relationship with ISO Standards , 2010, LREC.

[6]  Martin Reynaert Synergy of Nederlab and @Philos TEI: diachronic and multilingual Text- Induced Corpus Clean-up , 2014, LREC 2014.

[7]  Martin Reynaert Character confusion versus focus word-based correction of spelling and OCR variants in corpora , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[8]  Oliver Christ,et al.  A Modular and Flexible Architecture for an Integrated Corpus Query System , 1994, ArXiv.

[9]  Hennie Brugman,et al.  Nederlab: Towards a Single Portal and Research Environment for Diachronic Dutch Text Corpora , 2016, LREC.

[10]  Amir Zeldes,et al.  PAULA XML Documentation , 2013 .

[11]  Gertjan van Noord,et al.  Alpino: Wide-coverage Computational Analysis of Dutch , 2000, CLIN.

[12]  Piek T. J. M. Vossen,et al.  Computer Assisted Semantic Annotation in the DutchSemCor Project , 2010, LREC.

[13]  Nancy Ide,et al.  International Standard for a Linguistic Annotation Framework , 2003, Natural Language Engineering.

[14]  Thomas M. Breuel The hOCR Microformat for OCR Workflow and Results , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[15]  Antal van den Bosch,et al.  T-Scan: a new tool for analyzing Dutch text , 2014, CLIN 2014.

[16]  Nelleke Oostdijk,et al.  The Construction of a 500-Million-Word Reference Corpus of Contemporary Written Dutch , 2013, Essential Speech and Language Technology for Dutch.

[17]  A.P.J. van den Bosch,et al.  BasiLex: An 11.5 million words corpus of Dutch texts written for children , 2014, CLIN 2014.

[18]  A.P.J. van den Bosch,et al.  PICCL: Philosophical Integrator of Computational and Corpus Libraries , 2015 .

[19]  Menno van Zaanen,et al.  OpenSoNaR: user-driven development of the SoNaR corpus interfaces , 2014, COLING.