The Frankfurt Latin Lexicon: From Morphological Expansion and Word Embeddings to SemioGraphs

In this article we present the Frankfurt Latin Lexicon (FLL), a lexical resource for Medieval Latin that is used both for the lemmatization of Latin texts and for the post-editing of lemmatizations. We describe recent advances in the development of lemmatizers and test them against the Capitularies corpus (comprising Frankish royal edicts, mid-6th to mid-9th century), a corpus created as a reference for processing Medieval Latin. We also consider the post-correction of lemmatizations using a limited crowdsourcing process aimed at continuous review and updating of the FLL. Starting from the texts resulting from this lemmatization process, we describe the extension of the FLL by means of word embeddings, whose interactive traversing by means of SemioGraphs completes the digital enhanced hermeneutic circle. In this way, the article argues for a more comprehensive understanding of lemmatization, encompassing classical machine learning as well as intellectual post-corrections and, in particular, human computation in the form of interpretation processes based on graph representations of the underlying lexical resources.

[1]  Milan Straka,et al.  Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe , 2017, CoNLL.

[2]  Marco Carlo Passarotti,et al.  Development and perspectives of the Latin morphological analyser LEMLAT , 2004 .

[3]  Eduardo L. Pasiliao,et al.  Graph-based exploration and clustering analysis of semantic spaces , 2019, Applied Network Science.

[4]  Amila Silva,et al.  On Learning Word Embeddings From Linguistically Augmented Text Corpora , 2019, IWCS.

[5]  Alexander Mehler,et al.  Towards a Network Model of the Coreness of Texts: An Experiment in Classifying Latin Texts Using the TTLab Latin Tagger , 2014, Text Mining.

[6]  Daniel Zeman,et al.  Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , 2017, CoNLL Shared Task.

[7]  Tolga Uslu,et al.  Skalenfreie online-soziale Lexika am Beispiel von Wiktionary , 2018 .

[8]  Z. Wang,et al.  The structure and dynamics of multilayer networks , 2014, Physics Reports.

[9]  Andrey Kutuzov,et al.  Vec2graph: A Python Library for Visualizing Word Embeddings as Graphs , 2019, AIST.

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  Marius L. Jøhndal,et al.  Creating a Parallel Treebank of the Old Indo-European BibleTranslations , 2008 .

[12]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[13]  Tolga Uslu,et al.  TextImager: a Distributed UIMA-based System for NLP , 2016, COLING.

[14]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[15]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[16]  Cornelis H. A. Koster,et al.  The AGFL Grammar Work Lab , 2002, USENIX Annual Technical Conference, FREENIX Track.

[17]  Mirella Lapata,et al.  Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , 2015 .

[18]  Wolfgang Raible,et al.  Von der Allgegenwart des Gegensinns (und einiger anderer Relationen) , 1981 .

[19]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[20]  Suresh Manandhar,et al.  Dependency Based Embeddings for Sentence Classification Tasks , 2016, NAACL.

[21]  Omer Levy,et al.  Dependency-Based Word Embeddings , 2014, ACL.

[22]  Cho-Jui Hsieh,et al.  Learning Word Embeddings for Low-Resource Languages by PU Learning , 2018, NAACL-HLT.

[23]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[24]  Daniel Kondratyuk,et al.  75 Languages, 1 Model: Parsing Universal Dependencies Universally , 2019, EMNLP.

[25]  Erhard W. Hinrichs,et al.  Service-oriented Architectures (SOAs) for the Humanities: Solutions and Impacts , 2012, DH.

[26]  Alexander Mehler,et al.  SOA implementation of the eHumanities Desktop , 2012 .

[27]  Greta Franzini,et al.  Nunc Est Aestimandum: Towards an Evaluation of the Latin WordNet , 2019, CLiC-it.

[28]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[29]  Wang Ling,et al.  Two/Too Simple Adaptations of Word2Vec for Syntax Problems , 2015, NAACL.

[30]  Gregory R. Crane,et al.  Building a digital library: the Perseus project as a case study in the humanities , 1996, DL '96.

[31]  David Bamman,et al.  The Ancient Greek and Latin Dependency Treebanks , 2011, Language Technology for Cultural Heritage.

[32]  Alexander Mehler,et al.  Voting for POS tagging of Latin texts: Using the flair of FLAIR to better Ensemble Classifiers by Example of Latin , 2020, LT4HALA.

[33]  Alexandra Ernst,et al.  A Corpus Management System for Historical Semantics , 2007 .

[34]  Nils Diewald,et al.  Evolution of Romance Language in Written Communication: Network Analysis of Late Latin and Early Romance Corpora , 2011, Leonardo.

[35]  Hinrich Schütze,et al.  Efficient Higher-Order CRFs for Morphological Tagging , 2013, EMNLP.

[36]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[37]  Daniel Kondratyuk,et al.  LemmaTag: Jointly Tagging and Lemmatizing for Morphologically-Rich Languages with BRNNs , 2018, EMNLP.

[38]  Alexander Mehler,et al.  TLT-CRF: A Lexicon-supported Morphological Tagger for Latin Based on Conditional Random Fields , 2016, LREC.

[39]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[40]  F. D. Saussure Cours de linguistique générale , 1924 .

[41]  H. Menge,et al.  Lehrbuch der lateinischen Syntax und Semantik. , 2000 .

[42]  Felice Dell'Orletta,et al.  Improvements in Parsing the Index Thomisticus Treebank. Revision, Combination and a Feature Model for Medieval Latin , 2010, LREC.

[43]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[44]  Panos M. Pardalos,et al.  Analysis of Images, Social Networks and Texts , 2014, Communications in Computer and Information Science.

[45]  Daniel Zeman,et al.  Challenges in Converting the Index Thomisticus Treebank into Universal Dependencies , 2018, UDW@EMNLP.

[46]  Nada Lavrac,et al.  LemmaGen: Multilingual Lemmatisation with Induced Ripple-Down Rules , 2010, J. Univers. Comput. Sci..

[47]  Alexander Mehler,et al.  Lemmatization and Morphological Tagging in German and Latin: A Comparison and a Survey of the State-of-the-art , 2016, LREC.

[48]  John F. Sowa,et al.  Knowledge representation: logical, philosophical, and computational foundations , 2000 .

[49]  Bernhard Jussen Perspektiven der Verwandtschaftsforschung fünfundzwanzig Jahre nach Jack Goodys "Entwicklung von Ehe und Familie in Europa" , 2009 .