Learning to Compute Word Embeddings On the Fly

Words in natural language follow a Zipfian distribution whereby some words are frequent but most are rare. Learning representations for words in the ``long tail'' of this distribution requires enormous amounts of data. Representations of rare words trained directly on end tasks are usually poor, requiring us to pre-train embeddings on external data, or treat all rare words as out-of-vocabulary words with a unique representation. We provide a method for predicting embeddings of rare words on the fly from small amounts of auxiliary data with a network trained end-to-end for the downstream task. We show that this improves results against baselines where embeddings are trained on the end task for reading comprehension, recognizing textual entailment and language modeling.

[1]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[2]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[3]  Jürgen Schmidhuber,et al.  Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , 1992, Neural Computation.

[4]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[5]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[6]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[7]  E. Loper,et al.  NLTK: The Natural Language Toolkit , 2006, ACL 2006.

[8]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[9]  Yoshua Bengio,et al.  Zero-data Learning of New Tasks , 2008, AAAI.

[10]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[11]  Alessandro Lenci,et al.  How we BLESSed distributional semantic evaluation , 2011, GEMS.

[12]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[13]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[14]  Gang Wang,et al.  RC-NET: A General Framework for Incorporating Knowledge into Word Representations , 2014, CIKM.

[15]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[16]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[17]  Yoshua Bengio,et al.  Blocks and Fuel: Frameworks for deep learning , 2015, ArXiv.

[18]  Wang Ling,et al.  Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation , 2015, EMNLP.

[19]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[20]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[21]  Zhen-Hua Ling,et al.  Enhancing and Combining Sequential and Tree LSTM for Natural Language Inference , 2016, ArXiv.

[22]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[23]  Louise Deléger,et al.  Overview of the Bacteria Biotope Task at BioNLP Shared Task 2016 , 2016, BioNLP.

[24]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[25]  David Grangier,et al.  Vocabulary Selection Strategies for Neural Machine Translation , 2016, ArXiv.

[26]  Jackie Chi Kit Cheung,et al.  Leveraging Lexical Resources for Learning Entity Embeddings in Multi-Relational Data , 2016, ACL.

[27]  Yoshua Bengio,et al.  Learning to Understand Phrases by Embedding the Dictionary , 2015, TACL.

[28]  Zhiguo Wang,et al.  Vocabulary Manipulation for Neural Machine Translation , 2016, ACL.

[29]  Jackie Chi Kit Cheung,et al.  World Knowledge for Reading Comprehension: Rare Entity Prediction with Hierarchical LSTMs Using External Descriptions , 2017, EMNLP.

[30]  Ruslan Salakhutdinov,et al.  A Comparative Study of Word Embeddings for Reading Comprehension , 2017, ArXiv.

[31]  Quoc V. Le,et al.  HyperNetworks , 2016, ICLR.

[32]  Dirk Weissenborn,et al.  Reading Twice for Natural Language Understanding , 2017, ArXiv.

[33]  Richard Socher,et al.  Dynamic Coattention Networks For Question Answering , 2016, ICLR.

[34]  Chris Dyer,et al.  Dynamic Integration of Background Knowledge in Neural NLU Systems , 2017, 1706.02596.

[35]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.