Interlinking through Lemmas. The Lexical Collection of the LiLa Knowledge Base of Linguistic Resources for Latin

This paper presents the structure of the LiLa Knowledge Base, i.e. a collection of multifarious linguistic resources for Latin described with the same vocabulary of knowledge description and interlinked according to the principles of the so-called Linked Data paradigm. Following its highly lexically based nature, the core of the LiLa Knowledge Base consists of a large collection of Latin lemmas, serving as the backbone to achieve interoperability between the resources, by linking all those entries in lexical resources and tokens in corpora that point to the same lemma. After detailing the architecture supporting LiLa , the paper particularly focusses on how we approach the challenges raised by harmonizing different strategies of lemmatization that can be found in linguistic resources for Latin. As an example of the process to connect a linguistic resource to LiLa , the inclusion in the Knowledge Base of a dependency treebank is described and evaluated.

[1]  Marco Passarotti,et al.  The Treatment of Word Formation in the LiLa Knowledge Base of Linguistic Resources for Latin , 2019 .

[2]  Christian Chiarcos,et al.  Linguistic Linked Open Data Cloud , 2020 .

[3]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[4]  John C. Traupman,et al.  Oxford Latin Dictionary , 1969 .

[5]  Marco Passarotti,et al.  Latin Vallex. A Treebank-based Semantic Valency Lexicon for Latin , 2016, LREC.

[6]  Greta Franzini,et al.  Nunc Est Aestimandum: Towards an Evaluation of the Latin WordNet , 2019, CLiC-it.

[7]  G. Celano,et al.  The Dependency Treebanks for Ancient Greek and Latin , 2019, Digital Classical Philology.

[8]  Christian Chiarcos,et al.  The Open Linguistics Working Group: Developing the Linguistic Linked Open Data Cloud , 2016, LREC.

[9]  J. Goodwin,et al.  Geographical Linked Data: The Administrative Geography of Great Britain on the Semantic Web , 2008 .

[10]  Mark Liberman,et al.  A formal framework for linguistic annotation , 1999, Speech Commun..

[11]  Huajun Chen,et al.  The Semantic Web , 2011, Lecture Notes in Computer Science.

[12]  Marco Carlo Passarotti,et al.  The Project of the Index Thomisticus Treebank , 2019, Digital Classical Philology.

[13]  Marco Passarotti,et al.  Overview of the EvaLatin 2020 Evaluation Campaign , 2020, LT4HALA.

[14]  Marco Passarotti,et al.  Challenges in Annotating Medieval Latin Charters , 2011, J. Lang. Technol. Comput. Linguistics.

[15]  Marco Passarotti,et al.  Representing Etymology in the LiLa Knowledge Base of Linguistic Resources for Latin , 2020, GLOBALEX.

[16]  Philipp Cimiano,et al.  The OntoLex-Lemon Model: Development and Applications , 2017 .

[17]  Christian Chiarcos,et al.  Interoperability of Corpora and Annotations , 2012, Linked Data in Linguistics.

[18]  Marco Passarotti,et al.  Nomen Omen. Enhancing the Latin Morphological Analyser Lemlat with an Onomasticon , 2016, LaTeCH@ACL.

[19]  David Bamman,et al.  The Annotation Guidelines of the Latin Dependency Treebank and Index Thomisticus Treebank: the Treatment of some specific Syntactic Constructions in Latin , 2008, LREC.

[20]  Marco Carlo Passarotti,et al.  (When) inflection needs derivation: a word formation lexicon for Latin , 2019 .

[21]  Beatrice Alex,et al.  Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC-2016) , 2016 .

[22]  Christian Chiarcos,et al.  OLiA - Ontologies of Linguistic Annotation , 2015, Semantic Web.

[23]  David Bamman,et al.  The Design and Use of a Latin Dependency Treebank , 2006 .

[24]  Jens Lehmann,et al.  Integrating NLP Using Linked Data , 2013, SEMWEB.

[25]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[26]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .

[27]  Marco Passarotti,et al.  Enhancing the Latin Morphological Analyser LEMLAT with a Medieval Latin Glossary , 2018, CLiC-it.

[28]  Deborah L. McGuinness,et al.  OWL Web ontology language overview , 2004 .

[29]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[30]  Alexander Mehler,et al.  Lexicon-assisted tagging and lemmatization in Latin: A comparison of six taggers and two lemmatization methods , 2015, LaTeCH@ACL.

[31]  Karlheinz Mörth,et al.  Towards Linked Language Data for Digital Humanities , 2012, Linked Data in Linguistics.

[32]  Christian Chiarcos,et al.  CoNLL-RDF: Linked Corpora Done in an NLP-Friendly Way , 2017, LDK.

[33]  Dan Brickley,et al.  Resource Description Framework (RDF) Model and Syntax Specification , 2002 .

[34]  Marco Passarotti,et al.  The Lemlat 3.0 Package for Morphological Analysis of Latin , 2017, ListLang@NoDaLiDa.

[35]  Marius L. Jøhndal,et al.  Creating a Parallel Treebank of the Old Indo-European BibleTranslations , 2008 .

[36]  Barbara McGillivray,et al.  The Development of the “Index Thomisticus” Treebank Valency Lexicon , 2009, LaTeCH - SHELT&R@EACL.

[37]  Marieke van Erp Reusing Linguistic Resources: Tasks and Goals for a Linked Data Approach , 2012, Linked Data in Linguistics.

[38]  Daniel Zeman,et al.  Challenges in Converting the Index Thomisticus Treebank into Universal Dependencies , 2018, UDW@EMNLP.

[39]  Philipp Cimiano,et al.  Ontology Lexicalisation: The lemon Perspective , 2011 .