An Ontology for CoNLL-RDF: Formal Data Structures for TSV Formats in Language Technology

In language technology and language sciences, tab-separated values (TSV) represent a frequently used formalism to represent linguistically annotated natural language, often addressed as "CoNLL formats". A large number of such formats do exist, but although they share a number of common features, they are not interoperable, as different pieces of information are encoded differently in these dialects. CoNLL-RDF refers to a programming library and the associated data model that has been introduced to facilitate processing and transforming such TSV formats in a serialization-independent way. CoNLL-RDF represents CoNLL data, by means of RDF graphs and SPARQL update operations, but so far, without machine-readable semantics, with annotation properties created dynamically on the basis of a user-defined mapping from columns to labels. Current applications of CoNLL-RDF include linking between corpora and dictionaries [Mambrini and Passarotti, 2019] and knowledge graphs [Tamper et al., 2018], syntactic parsing of historical languages [Chiarcos et al., 2018; Chiarcos et al., 2018], the consolidation of syntactic and semantic annotations [Chiarcos and Fath, 2019], a bridge between RDF corpora and a traditional corpus query language [Ionov et al., 2020], and language contact studies [Chiarcos et al., 2018]. We describe a novel extension of CoNLL-RDF, introducing a formal data model, formalized as an ontology. The ontology is a basis for linking RDF corpora with other Semantic Web resources, but more importantly, its application for transformation between different TSV formats is a major step for providing interoperability between CoNLL formats.

[1]  Marco Passarotti,et al.  Linked Open Treebanks. Interlinking Syntactically Annotated Corpora in the LiLa Knowledge Base of Linguistic Resources for Latin , 2019, Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019).

[2]  Antske Fokkens,et al.  NAF and GAF: Linking Linguistic Annotations , 2014 .

[3]  Adam Kilgarriff,et al.  The Sketch Engine , 2004 .

[4]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[5]  Jens Lehmann,et al.  Integrating NLP Using Linked Data , 2013, SEMWEB.

[6]  Christian Chiarcos,et al.  POWLA: Modeling Linguistic Corpora in OWL/DL , 2012, ESWC.

[7]  Christian Chiarcos,et al.  A Tree Extension for CoNLL-RDF , 2020, LREC.

[8]  Christian Chiarcos,et al.  CoNLL-RDF: Linked Corpora Done in an NLP-Friendly Way , 2017, LDK.

[9]  Karin M. Verspoor,et al.  Towards Adaptation of Linguistic Annotations to Scholarly Annotation Formalisms on the Semantic Web , 2012, LAW@ACL.

[10]  Eero Hyvönen,et al.  Semantic National Biography of Finland , 2018, DHN.

[11]  James Pustejovsky,et al.  The LAPPS Interchange Format , 2015, WLSI.

[12]  Christian Chiarcos,et al.  Towards LLOD-based Language Contact Studies: A Case Study in Interoperability , 2018 .

[13]  Christian Chiarcos,et al.  Annotating a Low-Resource Language with LLOD Technology: Sumerian Morphology and Syntax , 2018, Inf..

[14]  Christian Chiarcos,et al.  Automatic Detection of Language and Annotation Model Information in CoNLL Corpora , 2019, LDK.

[15]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[16]  Steven Pemberton,et al.  Web Annotation Data Model , 2017 .

[17]  Christian Chiarcos,et al.  Fintan - Flexible, Integrated Transformation and Annotation eNgineering , 2020, LREC.

[18]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[19]  Ana-Maria Ghiran,et al.  Semantic integration of security knowledge sources , 2018, 2018 12th International Conference on Research Challenges in Information Science (RCIS).

[20]  Yunyao Li,et al.  Generating High Quality Proposition Banks for Multilingual Semantic Role Labeling , 2015, ACL.

[21]  Christian Chiarcos,et al.  Graph-Based Annotation Engineering: Towards a Gold Corpus for Role and Reference Grammar , 2019, LDK.

[22]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[23]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[24]  Christian Chiarcos,et al.  Analyzing Middle High German Syntax with RDF and SPARQL , 2018, LREC.

[25]  Eero Hyvönen,et al.  Using Biographical Texts as Linked Data for Prosopographical Research and Applications , 2018, EuroMed.

[26]  Christian Chiarcos,et al.  Linguistic Linked Data: Representation, Generation and Applications , 2020 .

[27]  Ryan Cotterell,et al.  UniMorph 2.0: Universal Morphology , 2018, LREC.

[28]  Nancy Ide,et al.  International Standard for a Linguistic Annotation Framework , 2003, Natural Language Engineering.

[29]  Oliver Christ,et al.  A Modular and Flexible Architecture for an Integrated Corpus Query System , 1994, ArXiv.

[30]  Stefan Evert,et al.  Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium , 2011 .

[31]  Christian Chiarcos,et al.  cqp4rdf: Towards a Suite for RDF-Based Corpus Linguistics , 2020, ESWC.

[32]  Adil El Ghali,et al.  TELIX: An RDF-Based Model for Linguistic Annotation , 2012, ESWC.

[33]  Christian Chiarcos,et al.  A generic formalism to represent linguistic corpora in RDF and OWL/DL , 2012, LREC.