Hierarchical Density Order Embeddings

By representing words with probability densities rather than point vectors, probabilistic word embeddings can capture rich and interpretable semantic information and uncertainty. The uncertainty information can be particularly meaningful in capturing entailment relationships -- whereby general words such as "entity" correspond to broad distributions that encompass more specific words such as "animal" or "instrument". We introduce density order embeddings, which learn hierarchical representations through encapsulation of probability densities. In particular, we propose simple yet effective loss functions and distance metrics, as well as graph-based schemes to select negative samples to better learn hierarchical density representations. Our approach provides state-of-the-art performance on the WordNet hypernym relationship prediction task and the challenging HyperLex lexical entailment dataset -- while retaining a rich and interpretable density representation.

[1]  A. Rényi On Measures of Entropy and Information , 1961 .

[2]  Daniel Marcu,et al.  Natural Language Communication with Robots , 2016, NAACL.

[3]  Alice Lai,et al.  Learning to Predict Denotational Probabilities For Modeling Entailment , 2017, EACL.

[4]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[5]  Tony Jebara,et al.  Probability Product Kernels , 2004, J. Mach. Learn. Res..

[6]  Ivan Vulić,et al.  Specialising Word Vectors for Lexical Entailment , 2017, NAACL.

[7]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[8]  I. Vajda,et al.  Convex Statistical Distances , 2018, Statistical Inference for Engineers and Data Scientists.

[9]  Phil Blunsom,et al.  Reasoning about Entailment with Neural Attention , 2015, ICLR.

[10]  Z. Q. Lu Statistical Inference Based on Divergence Measures , 2007 .

[11]  Richard E. Turner,et al.  Rényi Divergence Variational Inference , 2016, NIPS.

[12]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[13]  Steve Bailey,et al.  UKWAC: Building the UK's First Public Web Archive , 2006, D Lib Mag..

[14]  Douwe Kiela,et al.  Poincaré Embeddings for Learning Hierarchical Representations , 2017, NIPS.

[15]  Raffaella Bernardi,et al.  Entailment above the word level in distributional semantics , 2012, EACL.

[16]  Alessandro Lenci,et al.  How we BLESSed distributional semantic evaluation , 2011, GEMS.

[17]  Danqi Chen,et al.  Reasoning With Neural Tensor Networks for Knowledge Base Completion , 2013, NIPS.

[18]  Sanja Fidler,et al.  Order-Embeddings of Images and Language , 2015, ICLR.

[19]  L. Pardo Statistical Inference Based on Divergence Measures , 2005 .

[20]  Xiang Li,et al.  Improved Representation Learning for Predicting Commonsense Ontologies , 2017, ArXiv.

[21]  Catherine Havasi,et al.  ConceptNet 5.5: An Open Multilingual Graph of General Knowledge , 2016, AAAI.

[22]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[23]  David J. Weir,et al.  Characterising Measures of Lexical Distributional Similarity , 2004, COLING.

[24]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[26]  Andrew Gordon Wilson,et al.  Multimodal Word Distributions , 2017, ACL.

[27]  Andrew McCallum,et al.  Word Representations via Gaussian Embedding , 2014, ICLR.

[28]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[29]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[30]  Felix Hill,et al.  HyperLex: A Large-Scale Evaluation of Graded Lexical Entailment , 2016, CL.

[31]  Stephen Clark,et al.  Exploiting Image Generality for Lexical Entailment Detection , 2015, ACL.

[32]  Qin Lu,et al.  Chasing Hypernyms in Vector Spaces with Entropy , 2014, EACL.