FacetE: exploiting web tables for domain-specific word embedding evaluation

Today's natural language processing and information retrieval systems heavily depend on word embedding techniques to represent text values. However, given a specific task deciding for a word embedding dataset is not trivial. Current word embedding evaluation methods mostly provide only a one-dimensional quality measure, which does not express how knowledge from different domains is represented in the word embedding models. To overcome this limitation, we provide a new evaluation data set called FacetE derived from 125M Web tables, enabling domain-sensitive evaluation. We show that FacetE can effectively be used to evaluate word embedding models. The evaluation of common general-purpose word embedding models suggests that there is currently no best word embedding for every domain.

[1]  Omer Levy,et al.  Linguistic Regularities in Sparse and Explicit Word Representations , 2014, CoNLL.

[2]  Thorsten Joachims,et al.  Evaluation methods for unsupervised word embeddings , 2015, EMNLP.

[3]  S. M. García,et al.  2014: , 2020, A Party for Lazarus.

[4]  Rubén Prieto-Díaz Implementing faceted classification for software reuse , 1991, CACM.

[5]  Susan C. Herring,et al.  A Faceted Classification Scheme for Computer-Mediated Discourse , 2007 .

[6]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[7]  Christian Bizer,et al.  Synthesizing N-ary Relations from Web Tables , 2019, WIMS2019.

[8]  L. Miles,et al.  2000 , 2000, RDH.

[9]  Evgeniy Gabrilovich,et al.  Large-scale learning of word relatedness with constraints , 2012, KDD.

[10]  Gemma Boleda,et al.  Distributional Semantics in Technicolor , 2012, ACL.

[11]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[12]  Farhad Nooralahzadeh,et al.  Evaluation of Domain-specific Word Embeddings using Knowledge Resources , 2018, LREC.

[13]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[14]  Ralf Krestel,et al.  Domain-specific word embeddings for patent classification , 2019, Data Technol. Appl..

[15]  Xiaojun Wan,et al.  Representation Learning for Aspect Category Detection in Online Reviews , 2015, AAAI.

[16]  Oded Shmueli,et al.  Using Word Embedding to Enable Semantic Queries in Relational Databases , 2017, DEEM@SIGMOD.

[17]  Felix Hill,et al.  SimVerb-3500: A Large-Scale Evaluation Set of Verb Similarity , 2016, EMNLP.

[18]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[19]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[20]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[21]  Yoav Goldberg,et al.  Improving Reliability of Word Similarity Evaluation by Redesigning Annotation Task and Performance Measure , 2016, RepEval@ACL.

[22]  Michael Günther FREDDY: Fast Word Embeddings in Database Systems , 2018, SIGMOD Conference.

[23]  David J. Weir,et al.  A critique of word similarity as a method for evaluating distributional semantic models , 2016, RepEval@ACL.

[24]  Tie-Yan Liu,et al.  WordRep: A Benchmark for Research on Learning Word Representations , 2014, ArXiv.

[25]  Christopher D. Manning,et al.  Better Word Representations with Recursive Neural Networks for Morphology , 2013, CoNLL.

[26]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[27]  Wolfgang Lehner,et al.  RETRO: Relation Retrofitting For In-Database Machine Learning on Textual Data , 2020, EDBT.

[28]  Wolfgang Lehner,et al.  Building the Dresden Web Table Corpus: A Classification Approach , 2015, 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC).

[29]  Guido Zuccon,et al.  Integrating and Evaluating Neural Word Embeddings in Information Retrieval , 2015, ADCS.

[30]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[31]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[32]  C. Martin 2015 , 2015, Les 25 ans de l’OMC: Une rétrospective en photos.

[33]  Pablo Gamallo Using the Outlier Detection Task to Evaluate Distributional Semantic Models , 2019, Mach. Learn. Knowl. Extr..

[34]  Alessandro Lenci,et al.  ESSLLI Workshop on Distributional Lexical Semantics Bridging the gap between semantic theory and computational simulations , 2008 .

[35]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[36]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[37]  Amir Bakarov,et al.  A Survey of Word Embeddings Evaluation Methods , 2018, ArXiv.

[38]  Hongfang Liu,et al.  A Comparison of Word Embeddings for the Biomedical Natural Language Processing , 2018, J. Biomed. Informatics.

[39]  William A. Sethares,et al.  Domain Adapted Word Embeddings for Improved Sentiment Classification , 2018, ACL.

[40]  Noam Slonim,et al.  TR9856: A Multi-word Term Relatedness Benchmark , 2015, ACL.

[41]  Roberto Navigli,et al.  Find the word that does not belong: A Framework for an Intrinsic Evaluation of Word Vector Representations , 2016, RepEval@ACL.