论文信息 - FoLiA in Practice. The Infrastructure of a Linguistic Annotation Format

FoLiA in Practice. The Infrastructure of a Linguistic Annotation Format

We present an overview of the software and data infrastructure for FoLiA, a Format for Linguistic Annotation developed within the scope of the CLARIN-NL project and other projects. FoLiA aims to provide a single unified file format accommodating a wide variety of linguistic annotation types, preventing the proliferation of different formats for different annotation types. FoLiA is being developed in a bottom-up and practice-driven fashion. We have invested mainly in the creation of a rich infrastructure of tools that enable developers and end-users to work with the format. This work will present the current state of this infrastructure.

[1] W. Spooren,et al. Diachronic changes in subjectivity and stance: A corpus linguistic study of Dutch news texts , 2012 .

[2] Andreas Witt,et al. A pragmatic approach to XML interoperability — the Component Metadata Infrastructure (CMDI) , 2011 .

[3] Martin Reynaert,et al. FoLiA: A practical XML Format for Linguistic Annotation - a descriptive and comparative study , 2014, CLIN 2014.

[4] Antal van den Bosch. Ucto: Unicode Tokeniser , 2012 .

[5] Erhard W. Hinrichs,et al. A Corpus Representation Format for Linguistic Web Services: The D-SPIN Text Corpus Format and its Relationship with ISO Standards , 2010, LREC.

[6] Martin Reynaert. Synergy of Nederlab and @Philos TEI: diachronic and multilingual Text- Induced Corpus Clean-up , 2014, LREC 2014.

[7] Martin Reynaert. Character confusion versus focus word-based correction of spelling and OCR variants in corpora , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[8] Oliver Christ,et al. A Modular and Flexible Architecture for an Integrated Corpus Query System , 1994, ArXiv.

[9] Hennie Brugman,et al. Nederlab: Towards a Single Portal and Research Environment for Diachronic Dutch Text Corpora , 2016, LREC.

[10] Amir Zeldes,et al. PAULA XML Documentation , 2013 .

[11] Gertjan van Noord,et al. Alpino: Wide-coverage Computational Analysis of Dutch , 2000, CLIN.

[12] Piek T. J. M. Vossen,et al. Computer Assisted Semantic Annotation in the DutchSemCor Project , 2010, LREC.

[13] Nancy Ide,et al. International Standard for a Linguistic Annotation Framework , 2003, Natural Language Engineering.

[14] Thomas M. Breuel. The hOCR Microformat for OCR Workflow and Results , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[15] Antal van den Bosch,et al. T-Scan: a new tool for analyzing Dutch text , 2014, CLIN 2014.

[16] Nelleke Oostdijk,et al. The Construction of a 500-Million-Word Reference Corpus of Contemporary Written Dutch , 2013, Essential Speech and Language Technology for Dutch.

[17] A.P.J. van den Bosch,et al. BasiLex: An 11.5 million words corpus of Dutch texts written for children , 2014, CLIN 2014.

[18] A.P.J. van den Bosch,et al. PICCL: Philosophical Integrator of Computational and Corpus Libraries , 2015 .

[19] Menno van Zaanen,et al. OpenSoNaR: user-driven development of the SoNaR corpus interfaces , 2014, COLING.