On the Utility of Word Embeddings for Enriching OpenWordNet-PT

The maintenance of wordnets and lexical knwoledge bases typically relies on time-consuming manual effort. In order to minimise this issue, we propose the exploitation of models of distributional semantics, namely word embeddings learned from corpora, in the automatic identification of relation instances missing in a wordnet. Analogy-solving methods are first used for learning a set of relations from analogy tests focused on each relation. Despite their low accuracy, we noted that a portion of the top-given answers are good suggestions of relation instances that could be included in the wordnet. This procedure is applied to the enrichment of OpenWordNet-PT, a public Portuguese wordnet. Relations are learned from data acquired from this resource, and illustrative examples are provided. Results are promising for accelerating the identification of missing relation instances, as we estimate that about 17% of the potential suggestions are good, a proportion that almost doubles if some are automatically invalidated. 2012 ACM Subject Classification Computing methodologies → Lexical semantics; Computing methodologies → Language resources

[1]  Gerard de Melo,et al.  OpenWordNet-PT: An Open Brazilian Wordnet for Reasoning , 2012, COLING.

[2]  Simone Paolo Ponzetto,et al.  BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[3]  Ana Alves,et al.  TALES: Test Set of Portuguese Lexical-Semantic Relations for AssessingWord Embeddings , 2020, HI4NLP@ECAI.

[4]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[5]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[6]  Erik Velldal,et al.  Wordnet extension via word embeddings: Experiments on the Norwegian Wordnet , 2017, NODALIDA.

[7]  M. Gernsbacher Resolving 20 years of inconsistent interactions between lexical familiarity and orthography, concreteness, and polysemy. , 1984, Journal of experimental psychology. General.

[8]  Tiago Sousa,et al.  Exploring Different Methods for Solving Analogies with Portuguese Word Embeddings , 2020, SLATE.

[9]  Francis Bond,et al.  A Survey of WordNets and their Licenses , 2011 .

[10]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[11]  Nuno Seco,et al.  PAPEL: A Dictionary-Based Lexical Ontology for Portuguese , 2008, PROPOR.

[12]  A. Paivio,et al.  Concreteness, imagery, and meaningfulness values for 925 nouns. , 1968, Journal of experimental psychology.

[13]  Jordan L. Boyd-Graber,et al.  Adding dense, weighted connections to WordNet , 2005 .

[14]  Satoshi Matsuoka,et al.  Word Embeddings, Analogies, and Machine Learning: Beyond king - man + woman = queen , 2016, COLING.

[15]  Kawin Ethayarajh,et al.  How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings , 2019, EMNLP.

[16]  Nicoletta Calzolari,et al.  Working on the Italian Machine Dictionary: A Semantic Approach , 1973, COLING.

[17]  Satoshi Matsuoka,et al.  Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. , 2016, NAACL.

[18]  Gerhard Weikum,et al.  Towards a universal wordnet by learning from combined evidence , 2009, CIKM.

[19]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[20]  Francis Bond,et al.  Linking and Extending an Open Multilingual Wordnet , 2013, ACL.

[21]  Gregor Wiedemann,et al.  Does BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized Embeddings , 2019, KONVENS.

[22]  Patrick Pantel,et al.  Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations , 2006, ACL.

[23]  Roberto de Alencar Lotufo,et al.  BERTimbau: Pretrained BERT Models for Brazilian Portuguese , 2020, BRACIS.

[24]  Christiane Fellbaum,et al.  Automated WordNet Construction Using Word Embeddings , 2017 .

[25]  Piek Vossen,et al.  EuroWordNet: A multilingual database with lexical semantic networks , 1998, Springer Netherlands.

[26]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[27]  Nathan Hartmann,et al.  Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks , 2017, STIL.

[28]  Hugo Gonçalo Oliveira,et al.  ECO and Onto.PT: a flexible approach for creating a Portuguese wordnet automatically , 2014, Lang. Resour. Evaluation.

[29]  Gerard de Melo,et al.  NomLex-PT: A Lexicon of Portuguese Nominalizations , 2014, LREC.

[30]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[31]  Daniel Jurafsky,et al.  Learning Syntactic Patterns for Automatic Hypernym Discovery , 2004, NIPS.

[32]  Steven Schockaert,et al.  Inducing Relational Knowledge from BERT , 2019, AAAI.

[33]  Lluís Padró,et al.  FreeLing 3.0: Towards Wider Multilinguality , 2012, LREC.

[34]  Alexandre Rademaker,et al.  MorphoBr: an open source large-coverage full-form lexicon for morphological analysis of Portuguese , 2018, Texto Livre: Linguagem e Tecnologia.