论文信息 - Recent Initiatives towards New Standards for Language Resources

Recent Initiatives towards New Standards for Language Resources

This poster is aimed at providing an overview of three ongoing initiatives towards language resource (LR) standards coordinated and initiated by the German mirror group of ISO TC 37/SC 41 within DIN2 (Deutsches Institut fur Normung): • ISOTiger, an XML serialization of proposals for the syntactic annotation of text corpora; • “Transcription of spoken language”, a set of guidelines for transcribing spoken utterances; • “Corpus Query Lingua Franca”, a metastandard for the comparison of the formal properties of corpus query languages. Coordinated by German experts, these upcoming international standards3 are all part of initiatives to standardize data formats and procedures for language resources internationally. The present poster is intended not only to inform about the ongoing work, but also to initiate a discussion with additional experts to reflect the interests of the community. Standards for LRs in the framework of ISO TC 37 cover several types of resources (text corpora, lexicons, terminology collections). Actors in computational linguistics and language technology cooperate and thus need to exchange data and technologies using comparable methods and formats, cf. (Eckart and Heid, 2014). Most of the proposed standards are guidelines on a meta-level, describing properties of representation formats, instead of prescribing a format. Examples of these are the Lexical markup framework (LMF, ISO 24613:2008),

[1] Wolfgang Lezius,et al. TIGER: Linguistic Interpretation of a German Corpus , 2004 .

[2] Ulrich Heid,et al. Resource interoperability revisited , 2014, KONVENS.

[4] Thomas Schmidt. A TEI-based Approach to Standardising Spoken Language Transcription , 2011 .