Local Density Estimation in High Dimensions

An important question that arises in the study of high dimensional vector representations learned from data is: given a set $\mathcal{D}$ of vectors and a query $q$, estimate the number of points within a specified distance threshold of $q$. We develop two estimators, LSH Count and Multi-Probe Count that use locality sensitive hashing to preprocess the data to accurately and efficiently estimate the answers to such questions via importance sampling. A key innovation is the ability to maintain a small number of hash tables via preprocessing data structures and algorithms that sample from multiple buckets in each hash table. We give bounds on the space requirements and sample complexity of our schemes, and demonstrate their effectiveness in experiments on a standard word embedding dataset.

[1]  Alexandr Andoni,et al.  Practical and Optimal LSH for Angular Distance , 2015, NIPS.

[2]  Anshumali Shrivastava,et al.  A New Unbiased and Efficient Class of LSH-Based Samplers and Estimators for Partition Function Computation in Log-Linear Models , 2017, ArXiv.

[3]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[4]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[5]  Rina Panigrahy,et al.  Entropy based nearest neighbor search in high dimensions , 2005, SODA '06.

[6]  Alexandr Andoni,et al.  Tight Lower Bounds for Data-Dependent Locality-Sensitive Hashing , 2015, SoCG.

[7]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[8]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[9]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[10]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[11]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  David P. Williamson,et al.  Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming , 1995, JACM.

[13]  Andrew Y. Ng,et al.  Improving Word Representations via Global Context and Multiple Word Prototypes , 2012, ACL.

[14]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[15]  Rachel Ward,et al.  Fast Cross-Polytope Locality-Sensitive Hashing , 2016, ITCS.

[16]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[17]  Yi Wu,et al.  Optimal Lower Bounds for Locality-Sensitive Hashing (Except When q is Tiny) , 2014, TOCT.

[18]  Alexandr Andoni,et al.  Optimal Data-Dependent Hashing for Approximate Near Neighbors , 2015, STOC.

[19]  Jian Pei,et al.  Community Preserving Network Embedding , 2017, AAAI.

[20]  Siddharth Patwardhan,et al.  The Role of Context Types and Dimensionality in Learning Word Embeddings , 2016, NAACL.

[21]  Yuzuru Tanaka,et al.  Spherical LSH for Approximate Nearest Neighbor Search on Unit Hypersphere , 2007, WADS.

[22]  Kohei Sugawara,et al.  On Approximately Searching for Similar Word Embeddings , 2016, ACL.

[23]  Guilherme Dias da Fonseca,et al.  A Unified Approach to Approximate Proximity Searching , 2010, ESA.

[24]  Rajeev Motwani,et al.  Lower bounds on locality sensitive hashing , 2005, SCG '06.

[25]  Rasmus Pagh,et al.  Parameter-free Locality Sensitive Hashing for Spherical Range Reporting , 2016, SODA.

[26]  Mingzhe Wang,et al.  LINE: Large-scale Information Network Embedding , 2015, WWW.

[27]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[28]  Ruslan Salakhutdinov,et al.  Revisiting Semi-Supervised Learning with Graph Embeddings , 2016, ICML.

[29]  Jun Zhao,et al.  How to Generate a Good Word Embedding , 2015, IEEE Intelligent Systems.

[30]  Moses Charikar,et al.  Hashing-Based-Estimators for Kernel Density in High Dimensions , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[31]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[32]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[33]  Qiongkai Xu,et al.  GraRep: Learning Graph Representations with Global Structural Information , 2015, CIKM.