Creation and Evaluation of Datasets for Distributional Semantics Tasks in the Digital Humanities Domain

Word embeddings are already well studied in the general domain, usually trained on large text corpora, and have been evaluated for example on word similarity and analogy tasks, but also as an input to downstream NLP processes. In contrast, in this work we explore the suitability of word embedding technologies in the specialized digital humanities domain. After training embedding models of various types on two popular fantasy novel book series, we evaluate their performance on two task types: term analogies, and word intrusion. To this end, we manually construct test datasets with domain experts. Among the contributions are the evaluation of various word embedding techniques on the different task types, with the findings that even embeddings trained on small corpora perform well for example on the word intrusion task. Furthermore, we provide extensive and high-quality datasets in digital humanities for further investigation, as well as the implementation to easily reproduce or extend the experiments.

[1]  Dmitry I. Ilvovsky,et al.  Extracting Social Networks from Literary Text with Word Embedding Tools , 2016, LT4DH@COLING.

[2]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[3]  Gemma Boleda,et al.  Distributed Prediction of Relations for Entities: The Easy, The Difficult, and The Impossible , 2017, *SEM.

[4]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[5]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[6]  Gemma Boleda,et al.  Distributional vectors encode referential attributes , 2015, EMNLP.

[7]  Iryna Gurevych,et al.  Personality Profiling of Fictional Characters using Sense-Level Links between Lexical Resources , 2015, EMNLP.

[8]  Marco Baroni,et al.  High-risk learning: acquiring new word vectors from tiny data , 2017, EMNLP.

[9]  Alessandro Lenci,et al.  The Effects of Data Size and Frequency Range on Distributional Semantic Models , 2016, EMNLP.

[10]  Yulia Tsvetkov,et al.  Problems With Evaluation of Word Embeddings Using Word Similarity Tasks , 2016, RepEval@ACL.

[11]  Serguei V. S. Pakhomov,et al.  What Analogies Reveal about Word Vectors and their Compositionality , 2017, *SEM.

[12]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[13]  Gemma Boleda,et al.  Instances and concepts in distributional space , 2017, EACL.

[14]  Pierre Lison,et al.  Redefining Context Windows for Word Embedding Models: An Experimental Study , 2017, NODALIDA.

[15]  Stephen Clark,et al.  Vector Space Models of Lexical Meaning , 2015 .

[16]  Katrin Erk,et al.  Vector Space Models of Word Meaning and Phrase Meaning: A Survey , 2012, Lang. Linguistics Compass.

[17]  Christopher D. Manning,et al.  Evaluating Word Embeddings Using a Representative Suite of Practical Tasks , 2016, RepEval@ACL.

[18]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[19]  Aurélie Herbelot,et al.  Mr Darcy and Mr Toad, gentlemen: distributional names and their kinds , 2015, IWCS.

[20]  Benoît Favre,et al.  Word Embedding Evaluation and Combination , 2016, LREC.

[21]  Aline Villavicencio,et al.  Enhancing the LexVec Distributed Word Representation Model Using Positional Contexts and External Memory , 2016, ArXiv.

[22]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[23]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[24]  Omer Levy,et al.  Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[25]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[26]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[27]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[28]  Tal Linzen,et al.  Issues in evaluating semantic spaces using word analogies , 2016, RepEval@ACL.

[29]  Heiko Paulheim,et al.  RDF2Vec: RDF Graph Embeddings for Data Mining , 2016, SEMWEB.

[30]  David Mimno,et al.  Evaluating the Stability of Embedding-based Word Similarities , 2018, TACL.

[31]  Anthony Bonato,et al.  Mining and Modeling Character Networks , 2016, WAW.

[32]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[33]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[34]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[35]  Egoitz Laparra,et al.  From TimeLines to StoryLines: A preliminary proposal for evaluating narratives , 2015 .

[36]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[37]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[38]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .