ForestDSH: a universal hash design for discrete probability distributions

In this paper, we consider the problem of classification of high dimensional queries to high dimensional classes from discrete alphabets where the probabilistic model that relates data to the classes is known. This problem has applications in various fields including the database search problem in mass spectrometry. The problem is analogous to the nearest neighbor search problem, where the goal is to find the data point in a database that is the most similar to a query point. The state of the art method for solving an approximate version of the nearest neighbor search problem in high dimensions is locality sensitive hashing (LSH). LSH is based on designing hash functions that map near points to the same buckets with a probability higher than random (far) points. To solve our high dimensional classification problem, we introduce distribution sensitive hashes that map jointly generated pairs to the same bucket with probability higher than random pairs. We design distribution sensitive hashes using a forest of decision trees and we analytically derive the complexity of search. We further show that the proposed hashes perform faster than state of the art approximate nearest neighbor search methods for a range of probability distributions, in both theory and simulations. Finally, we apply our method to the spectral library search problem in mass spectrometry, and show that it is an order of magnitude faster than the state of the art methods.

[1]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[2]  Amit Chakrabarti,et al.  An Optimal Randomized Cell Probe Lower Bound for Approximate Nearest Neighbor Searching , 2010, SIAM J. Comput..

[3]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[4]  Aviad Rubinstein,et al.  Hardness of approximate nearest neighbor search , 2018, STOC.

[5]  Lawrence Carin,et al.  Large-Scale Bayesian Multi-Label Learning via Topic-Based Label Embeddings , 2015, NIPS.

[6]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[7]  Alexandr Andoni,et al.  Optimal Data-Dependent Hashing for Approximate Near Neighbors , 2015, STOC.

[8]  Ioannis Z. Emiris,et al.  Low-quality dimension reduction and high-dimensional Approximate Nearest Neighbor , 2015, Symposium on Computational Geometry.

[9]  Moshe Dubiner A Heterogeneous High-Dimensional Approximate Nearest Neighbor Algorithm , 2012, IEEE Transactions on Information Theory.

[10]  Yang Yu,et al.  Binary Linear Compression for Multi-label Classification , 2017, IJCAI.

[11]  Alexandr Andoni,et al.  Data-dependent hashing via nonlinear spectral gaps , 2018, STOC.

[12]  Alexandr Andoni,et al.  Optimal Hashing-based Time-Space Trade-offs for Approximate Near Neighbors , 2016, SODA.

[13]  R. Aebersold,et al.  Mass spectrometry-based proteomics , 2003, Nature.

[14]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[15]  Ehsan Abbasnejad,et al.  Label Filters for Large Scale Multilabel Classification , 2017, AISTATS.

[16]  John Langford,et al.  Logarithmic Time Online Multiclass prediction , 2015, NIPS.

[17]  Rob Knight,et al.  American Gut: an Open Platform for Citizen Science Microbiome Research , 2018, mSystems.

[18]  Pradeep Ravikumar,et al.  PD-Sparse : A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification , 2016, ICML.

[19]  Moshe Dubiner,et al.  Bucketing Coding and Information Theory for the Statistical High-Dimensional Nearest-Neighbor Problem , 2008, IEEE Transactions on Information Theory.

[20]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[21]  Weiwei Liu,et al.  Making Decision Trees Feasible in Ultrahigh Feature and Label Dimensions , 2017, J. Mach. Learn. Res..

[22]  Manik Varma,et al.  Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking & Other Missing Label Applications , 2016, KDD.

[23]  Tony Jebara,et al.  Structure preserving embedding , 2009, ICML '09.

[24]  Manik Varma,et al.  FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning , 2014, KDD.

[25]  Jitendra Malik,et al.  Shape contexts enable efficient retrieval of similar shapes , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[26]  Pavel A. Pevzner,et al.  Universal database search tool for proteomics , 2014, Nature Communications.

[27]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[28]  Yukihiro Tagami,et al.  AnnexML: Approximate Nearest Neighbor Search for Extreme Multi-label Classification , 2017, KDD.

[29]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[30]  Pavel A. Pevzner,et al.  Spectral Archives: Extending Spectral Libraries to Analyze both Identified and Unidentified Spectra , 2011, Nature Methods.

[31]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[32]  Rasmus Pagh,et al.  Set similarity search beyond MinHash , 2017, STOC.

[33]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[34]  Trevor Darrell,et al.  Fast pose estimation with parameter-sensitive hashing , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[35]  Johannes Fürnkranz,et al.  Maximizing Subset Accuracy with Recurrent Neural Networks in Multi-label Classification , 2017, NIPS.

[36]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.