PBoS: Probabilistic Bag-of-Subwords for Generalizing Word Embedding

We look into the task of \emph{generalizing} word embeddings: given a set of pre-trained word vectors over a finite vocabulary, the goal is to predict embedding vectors for out-of-vocabulary words, \emph{without} extra contextual information. We rely solely on the spellings of words and propose a model, along with an efficient algorithm, that simultaneously models subword segmentation and computes subword-based compositional word embedding. We call the model probabilistic bag-of-subwords (PBoS), as it applies bag-of-subwords for all possible segmentations based on their likelihood. Inspections and affix prediction experiment show that PBoS is able to produce meaningful subword segmentations and subword rankings without any source of explicit morphological knowledge. Word similarity and POS tagging experiments show clear advantages of PBoS over previous subword-level models in the quality of generated word embeddings across languages.

[1]  Kazuki Fukui,et al.  Word-like character n-gram embedding , 2018, NUT@EMNLP.

[2]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[3]  Hinrich Schütze,et al.  Attentive Mimicking: Better Word Embeddings by Attending to Informative Contexts , 2019, NAACL-HLT.

[4]  Mikhail Khodak,et al.  A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors , 2018, ACL.

[5]  Hyopil Shin,et al.  Grapheme-level Awareness in Word Embeddings for Morphologically Rich Languages , 2018, LREC.

[6]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[7]  Ryan Cotterell,et al.  Morphological Word-Embeddings , 2019, NAACL.

[8]  Christopher D. Manning,et al.  Better Word Representations with Recursive Neural Networks for Morphology , 2013, CoNLL.

[9]  Kentaro Inui,et al.  Subword-based Compact Reconstruction of Word Embeddings , 2019, NAACL-HLT.

[10]  Ahmet Üstün,et al.  Characters or Morphemes: How to Represent Words? , 2018, Rep4NLP@ACL.

[11]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[12]  Steven Skiena,et al.  Polyglot: Distributed Word Representations for Multilingual NLP , 2013, CoNLL.

[13]  Kris Cao,et al.  A Joint Model for Word Embedding and Word Morphology , 2016, Rep4NLP@ACL.

[14]  Martin Haspelmath,et al.  The indeterminacy of word segmentation and the nature of morphology and syntax , 2011 .

[15]  Mathias Creutz,et al.  Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[16]  Yang Xu,et al.  Treat the Word As a Whole or Look Inside? Subword Embeddings Model Language Change and Typology , 2019 .

[17]  Xiaoyong Du,et al.  Subword-level Composition Functions for Learning Word Embeddings , 2018 .

[18]  Karl Stratos,et al.  Compositional Morpheme Embeddings with Affixes as Functions and Stems as Arguments , 2018 .

[19]  Ryan Cotterell,et al.  Morphological Smoothing and Extrapolation of Word Embeddings , 2016, ACL.

[20]  Marco Marelli,et al.  Compositional-ly Derived Representations of Morphologically Complex Words in Distributional Semantics , 2013, ACL.

[21]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[22]  Jacob Eisenstein,et al.  Mimicking Word Embeddings using Subword RNNs , 2017, EMNLP.

[23]  Kevin Gimpel,et al.  Charagram: Embedding Words and Sentences via Character n-grams , 2016, EMNLP.

[24]  Nigel Collier,et al.  Card-660: Cambridge Rare Word Dataset - a Reliable Benchmark for Infrequent Word Representation Models , 2018, EMNLP 2018.

[25]  Roi Reichart,et al.  Separated by an Un-common Language: Towards Judgment Language Informed Vector Space Modeling , 2015 .

[26]  Anna Korhonen,et al.  A Systematic Study of Leveraging Subword Information for Learning Word Representations , 2019, NAACL.

[27]  Mikko Kurimo,et al.  Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline , 2013 .

[28]  Parminder Bhatia,et al.  Morphological Priors for Probabilistic Neural Word Embeddings , 2016, EMNLP.

[29]  Yang Xu,et al.  Incorporating Latent Meanings of Morphological Compositions to Enhance Word Embeddings , 2018, ACL.

[30]  SangKeun Lee,et al.  Learning to Generate Word Representations using Subword Information , 2018, COLING.

[31]  Radu Soricut,et al.  Unsupervised Morphology Induction Using Word Embeddings , 2015, NAACL.

[32]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[33]  Takamasa Oshikiri Segmentation-Free Word Embedding for Unsegmented Languages , 2017, EMNLP.

[34]  Tie-Yan Liu,et al.  KNET: A General Framework for Learning Word Embedding Using Morphological Knowledge , 2014, TOIS.

[35]  Kazuki Fukui,et al.  Segmentation-free compositional n-gram embedding , 2018, NAACL.

[36]  Wei Lu,et al.  Improving Word Embeddings with Convolutional Feature Learning and Subword Information , 2017, AAAI.

[37]  Aline Villavicencio,et al.  Incorporating Subword Information into Matrix Factorization Word Embeddings , 2018, ArXiv.

[38]  Yingyu Liang,et al.  Generalizing Word Embeddings using Bag of Subwords , 2018, EMNLP.

[39]  Hiroyuki Shindo,et al.  Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia , 2020, EMNLP.

[40]  Andrew Gordon Wilson,et al.  Probabilistic FastText for Multi-Sense Word Embeddings , 2018, ACL.

[41]  Jaime G. Carbonell,et al.  Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations , 2018, EMNLP.

[42]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[43]  Xiaoyong Du,et al.  Investigating Different Syntactic Context Types and Context Representations for Learning Word Embeddings , 2017, EMNLP.

[44]  Karl Stratos Reconstruction of Word Embeddings from Sub-Word Parameters , 2017, SWCN@EMNLP.