Annotating a Low-Resource Language with LLOD Technology: Sumerian Morphology and Syntax

This paper describes work on the morphological and syntactic annotation of Sumerian cuneiform as a model for low resource languages in general. Cuneiform texts are invaluable sources for the study of history, languages, economy, and cultures of Ancient Mesopotamia and its surrounding regions. Assyriology, the discipline dedicated to their study, has vast research potential, but lacks the modern means for computational processing and analysis. Our project, Machine Translation and Automated Analysis of Cuneiform Languages, aims to fill this gap by bringing together corpus data, lexical data, linguistic annotations and object metadata. The project’s main goal is to build a pipeline for machine translation and annotation of Sumerian Ur III administrative texts. The rich and structured data is then to be made accessible in the form of (Linguistic) Linked Open Data (LLOD), which should open them to a larger research community. Our contribution is two-fold: in terms of language technology, our work represents the first attempt to develop an integrative infrastructure for the annotation of morphology and syntax on the basis of RDF technologies and LLOD resources. With respect to Assyriology, we work towards producing the first syntactically annotated corpus of Sumerian.

[1]  Wojciech,et al.  Jaworski Ontology-Based Knowledge Discovery from Documents in Natural Language , 2009 .

[2]  Wojciech Jaworski Contents Modelling of Neo-Sumerian Ur III Economic Text Corpus , 2008, COLING.

[3]  Christian Chiarcos,et al.  Machine Translation and Automated Analysis of the Sumerian Language , 2017, LaTeCH@ACL.

[4]  Syntactic Annotation for a Hittite Corpus : Problems and Principles , 2017 .

[5]  Amir Zeldes,et al.  An NLP Pipeline for Coptic , 2016, LaTeCH@ACL.

[6]  Tom Elliott,et al.  Pleiades: an un-GIS for Ancient Geography , 2011, DH.

[7]  Jarle Ebeling,et al.  The Electronic Text Corpus of Sumerian Literature , 2007 .

[8]  Christian Chiarcos,et al.  CoNLL-RDF: Linked Corpora Done in an NLP-Friendly Way , 2017, LDK.

[9]  Pierre Nugues,et al.  A High-Performance Syntactic and Semantic Dependency Parser , 2010, COLING.

[10]  Veronika Laippala,et al.  Universal Dependencies 1.4 , 2015 .

[11]  Christian Chiarcos,et al.  Towards a Linked Open Data Edition of Sumerian Corpora , 2018, LREC.

[12]  Eric Smith,et al.  Query-based Annotation and the Sumerian Verbal Prefixes , 2010 .

[13]  Gerard de Melo Lexvo.org: Language-related information for the Linguistic Linked Data cloud , 2015, Semantic Web.

[14]  Joakim Nivre,et al.  MaltParser: A Language-Independent System for Data-Driven Dependency Parsing , 2007, Natural Language Engineering.

[15]  Philip Koehn,et al.  Statistical Machine Translation , 2010, EAMT.

[16]  David Bamman,et al.  The Ancient Greek and Latin Dependency Treebanks , 2011, Language Technology for Cultural Heritage.

[17]  Christiane Fellbaum,et al.  Towards Open Data for Linguistics: Linguistic Linked Data , 2013, New Trends of Research in Ontologies and Lexical Resources.

[18]  Abraham Hendrik Jagersma A descriptive grammar of Sumerian , 2010 .

[19]  Christian Chiarcos,et al.  OLiA - Ontologies of Linguistic Annotation , 2015, Semantic Web.

[20]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[21]  Diana Maynard,et al.  Creating Tools for Morphological Analysis of Sumerian , 2006, LREC.

[22]  Christian Chiarcos,et al.  The ACoLi CoNLL Libraries: Beyond Tab-Separated Values , 2018, LREC.

[23]  J. Hayes A manual of Sumerian grammar and texts , 1990 .