Tree-wise Distribution Sensitive hashing: Efficient Maximum likelihood Classification by joint dimensionality reduction in known probabilistic settings

We consider the problem of maximum likelihood classification of a high dimensional data point y to billions of classes $x_1,...,x_N$, where the conditional probability p(y|x) is known. In the most general case, the complexity of the brute-force method for this classification grows linearly, O(N), with the number of classes N. Efficient multiclass classification methods have been introduced to solve this problem with logarithmic complexity. However, these methods suffer from the curse of dimensionality, i.e., in large dimensions their complexity approaches $O(N)$ per query data point. In the special case where the conditional probability distribution $p(y|x)$ is a Gaussian centered at x, i.e., $p(y|x) \propto N (x,\sigma)$, the maximum likelihood classification reduces to the nearest neighbor search with the Euclidean norm. Sublinear methods based on locality sensitive hashing (LSH) have been introduced to solve an approximate version of the nearest neighbor search for high dimensional data. Inspired by these advances, here we introduce distribution sensitive hashing (DSH) to solve an approximate version of the maximum likelihood classification problem through joint dimensionality reduction. In the case of discrete probability distributions, we design TreeDSH, a universal family of distribution sensitive hashes based on the decision trees, and show that their complexity grow sub-linearly. Theory and simulation presented in this paper demonstrate that TreeDSH is more efficient than LSH-hamming and Min-Hashing schemes. Finally, we apply TreeDSH to the problem of peptide identification from mass spectrometry data.

[1]  R. Aebersold,et al.  Mass spectrometry-based proteomics , 2003, Nature.

[2]  Alexandr Andoni,et al.  Practical and Optimal LSH for Angular Distance , 2015, NIPS.

[3]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[4]  Manik Varma,et al.  FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning , 2014, KDD.

[5]  Peter Bro Miltersen Cell probe complexity-a survey , 1999 .

[6]  Alexandr Andoni,et al.  Optimal Data-Dependent Hashing for Approximate Near Neighbors , 2015, STOC.

[7]  Amit Chakrabarti,et al.  An optimal randomised cell probe lower bound for approximate nearest neighbour searching , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[8]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[9]  Edith Cohen,et al.  Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[10]  Alexandr Andoni,et al.  Data-dependent hashing via nonlinear spectral gaps , 2018, STOC.

[11]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[12]  Pavel A. Pevzner,et al.  Universal database search tool for proteomics , 2014, Nature Communications.

[13]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[14]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[15]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[16]  Moshe Dubiner A Heterogeneous High-Dimensional Approximate Nearest Neighbor Algorithm , 2012, IEEE Transactions on Information Theory.