What’s in an Embedding? Analyzing Word Embeddings through Multilingual Evaluation

In the last two years, there has been a surge of word embedding algorithms and research on them. However, evaluation has mostly been carried out on a narrow set of tasks, mainly word similarity/relatedness and word relation similarity and on a single language, namely English. We propose an approach to evaluate embeddings on a variety of languages that also yields insights into the structure of the embedding space by investigating how well word embeddings cluster along different syntactic features. We show that all embedding approaches behave similarly in this task, with dependency-based embeddings performing best. This effect is even more pronounced when generating low dimensional embeddings.

[1]  Phil Blunsom,et al.  Compositional Morphology for Word Representations and Language Modelling , 2014, ICML.

[2]  Kevin Gimpel,et al.  Tailoring Continuous Word Representations for Dependency Parsing , 2014, ACL.

[3]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[4]  Karl Stratos,et al.  Model-based Word Embeddings from Decompositions of Count Matrices , 2015, ACL.

[5]  Jonas Kuhn,et al.  Making Ellipses Explicit in Dependency Conversion for a German Treebank , 2012, LREC.

[6]  Omer Levy,et al.  Do Supervised Distributional Methods Really Learn Lexical Inference Relations? , 2015, NAACL.

[7]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[8]  Steven Skiena,et al.  Polyglot: Distributed Word Representations for Multilingual NLP , 2013, CoNLL.

[9]  Alexandra Kinyon,et al.  Building a Treebank for French , 2000, LREC.

[10]  Regina Barzilay,et al.  Low-Rank Tensors for Scoring Dependency Structures , 2014, ACL.

[11]  Noah A. Smith,et al.  Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers , 2013, ACL.

[12]  Kenji Sagae,et al.  Parsing Morphologically Rich Languages with (Mostly) Off-The-Shelf Software and Word Vectors , 2014 .

[13]  Arantza Díaz de Ilarraza,et al.  From Dependencies to Constituents in the Reference Corpus for the Processing of Basque (EPEC) , 2008, Proces. del Leng. Natural.

[14]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[15]  Díaz de Ilarraza Construction of a Basque Dependency Treebank , 2003 .

[16]  János Csirik,et al.  The Szeged Treebank , 2005, TSD.

[17]  Marcin Wolinski,et al.  Towards a Bank of Constituent Parse Trees for Polish , 2010, TSD.

[18]  Ann Bies,et al.  The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.

[19]  Joakim Nivre,et al.  Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation , 2006, LREC.

[20]  Marcin Woliński,et al.  A Preliminary Version of Składnica — a Treebank of Polish , 2011 .

[21]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[22]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[23]  Richard Johansson,et al.  Extended Constituent-to-Dependency Conversion for English , 2007, NODALIDA.

[24]  János Csirik,et al.  Hungarian Dependency Treebank , 2010, LREC.

[25]  Pascal Denis,et al.  Statistical French Dependency Parsing: Treebank Conversion and First Results , 2010, LREC.

[26]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[27]  Andrew McCallum,et al.  Lexicon Infused Phrase Embeddings for Named Entity Resolution , 2014, CoNLL.

[28]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[29]  Arantza Díaz de Ilarraza Sánchez,et al.  From Dependencies to Constituents in the Reference Corpus for the Processing of Basque (EPEC) , 2008 .