Distributional Semantics for Neo-Latin

We address the problem of creating and evaluating quality Neo-Latin word embeddings for the purpose of philosophical research, adapting the Nonce2Vec tool to learn embeddings from Neo-Latin sentences. This distributional semantic modeling tool can learn from tiny data incrementally, using a larger background corpus for initialization. We conduct two evaluation tasks: definitional learning of Latin Wikipedia terms, and learning consistent embeddings from 18th century Neo-Latin sentences pertaining to the concept of mathematical method. Our results show that consistent Neo-Latin word embeddings can be learned from this type of data. While our evaluation results are promising, they do not reveal to what extent the learned models match domain expert knowledge of our Neo-Latin texts. Therefore, we propose an additional evaluation method, grounded in expert-annotated data, that would assess whether learned representations are conceptually sound in relation to the domain of study.

[1]  Martin Potthast,et al.  CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , 2018, CoNLL.

[2]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[3]  Katrin Erk,et al.  Vector Space Models of Word Meaning and Phrase Meaning: A Survey , 2012, Lang. Linguistics Compass.

[4]  Alexander Mehler,et al.  Lexicon-assisted tagging and lemmatization in Latin: A comparison of six taggers and two lemmatization methods , 2015, LaTeCH@ACL.

[5]  Ryan Cotterell,et al.  Are All Languages Equally Hard to Language-Model? , 2018, NAACL.

[6]  David Bamman,et al.  The Ancient Greek and Latin Dependency Treebanks , 2011, Language Technology for Cultural Heritage.

[7]  Rob Koopman,et al.  BolVis: visualization for text-based research in philosophy , 2018 .

[8]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[9]  Hinrich Schütze,et al.  Rare Words: A Major Problem for Contextualized Embeddings And How to Fix it by Attentive Mimicking , 2019, AAAI.

[10]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[11]  Antske Fokkens,et al.  Evaluating the Consistency of Word Embeddings from Small Data , 2019, RANLP.

[12]  David Bamman,et al.  Extracting two thousand years of latin from a million book library , 2012, JOCCH.

[13]  Marco Carlo Passarotti,et al.  The Project of the Index Thomisticus Treebank , 2019, Digital Classical Philology.

[14]  Shalom Lappin,et al.  当代语义理论指南 = The Handbook of Contemporary Semantic Theory , 2015 .

[15]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[16]  Marco Passarotti,et al.  Vir is to Moderatus as Mulier is to Intemperans - Lemma Embeddings for Latin , 2019, CLiC-it.

[17]  Marco Passarotti,et al.  Challenges in Annotating Medieval Latin Charters , 2011, J. Lang. Technol. Comput. Linguistics.

[18]  Alessandro Lenci,et al.  The Effects of Data Size and Frequency Range on Distributional Semantic Models , 2016, EMNLP.

[19]  Martin Reynaert,et al.  Granularity versus Dispersion in the Dutch Diachronical Database of Lexical Frequencies TICCLAT , 2019 .

[20]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[21]  Daniel Zeman,et al.  Challenges in Converting the Index Thomisticus Treebank into Universal Dependencies , 2018, UDW@EMNLP.

[22]  Thorsten Joachims,et al.  Evaluation methods for unsupervised word embeddings , 2015, EMNLP.

[23]  Mike Kestemont,et al.  On the Feasibility of Automated Detection of Allusive Text Reuse , 2019, LaTeCH@NAACL-HLT.

[24]  Aurélie Herbelot,et al.  Towards Incremental Learning of Word Embeddings Using Context Informativeness , 2019, ACL.

[25]  A.P.J. van den Bosch,et al.  FoLiA in Practice. The Infrastructure of a Linguistic Annotation Format , 2017 .

[26]  Erhard W. Hinrichs,et al.  Language technology for digital humanities: introduction to the special issue , 2019, Language Resources and Evaluation.

[27]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[28]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[29]  Udo Hahn,et al.  Bad Company—Neighborhoods in Neural Embedding Spaces Considered Harmful , 2016, COLING.

[30]  A. Betti,et al.  History of Philosophy in Ones and Zeros , 2019, Methodological Advances in Experimental Philosophy.

[31]  Angeliki Lazaridou,et al.  Multimodal Word Meaning Induction From Minimal Exposure to Natural Text. , 2017, Cognitive science.

[32]  Johannes Bjerva,et al.  Word Embeddings Pointing the Way for Late Antiquity , 2015, LaTeCH@ACL.

[33]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[34]  Martin Reynaert Character confusion versus focus word-based correction of spelling and OCR variants in corpora , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[35]  Marco Baroni,et al.  High-risk learning: acquiring new word vectors from tiny data , 2017, EMNLP.

[36]  Barbara McGillivray Tools for historical corpus research , and a corpus of Latin , 2015 .

[37]  Marius L. Jøhndal,et al.  Creating a Parallel Treebank of the Old Indo-European BibleTranslations , 2008 .

[38]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[39]  Stephen Clark,et al.  Vector Space Models of Lexical Meaning , 2015 .