DLSH: a distribution-aware LSH scheme for approximate nearest neighbor query in cloud computing

Cloud computing needs to process and analyze massive high-dimensional data in a real-time manner. Approximate queries in cloud computing systems can provide timely queried results with acceptable accuracy, thus alleviating the consumption of a large amount of resources. Locality Sensitive Hashing (LSH) is able to maintain the data locality and support approximate queries. However, due to randomly choosing hash functions, LSH has to use too many functions to guarantee the query accuracy. The extra computation and storage overheads exacerbate the real performance of LSH. In order to reduce the overheads and deliver high performance, we propose a distribution-aware scheme, called DLSH, to offer cost-effective approximate nearest neighbor query service for cloud computing. The idea of DLSH is to leverage the principal components of the data distribution as the projection vectors of hash functions in LSH, further quantify the weight of each hash function and adjust the interval value in each hash table. We then refine the queried result set based on the hit frequency to significantly decrease the time overhead of distance computation. Extensive experiments in a large-scale cloud computing testbed demonstrate significant improvements in terms of multiple system performance metrics. We have released the source code of DLSH for public use.

[1]  Olivier Buisson,et al.  Z-grid-based probabilistic retrieval for scaling up content-based copy detection , 2007, CIVR '07.

[2]  Jie Wu,et al.  Efficient information retrieval for ranked queries in cost-effective cloud environments , 2012, 2012 Proceedings IEEE INFOCOM.

[3]  Shai Avidan,et al.  Coherency Sensitive Hashing , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Panos Kalnis,et al.  Quality and efficiency in high dimensional nearest neighbor search , 2009, SIGMOD Conference.

[5]  Nhan Nguyen,et al.  Lock-Free Cuckoo Hashing , 2014, 2014 IEEE 34th International Conference on Distributed Computing Systems.

[6]  Deli Zhang,et al.  An Efficient Lock-Free Logarithmic Search Data Structure Based on Multi-dimensional List , 2016, 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS).

[7]  Qiang Huang,et al.  Query-Aware Locality-Sensitive Hashing for Approximate Nearest Neighbor Search , 2015, Proc. VLDB Endow..

[8]  Cordelia Schmid,et al.  A Comparison of Affine Region Detectors , 2005, International Journal of Computer Vision.

[9]  Wei Liu,et al.  Supervised Discrete Hashing , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Yiwei Thomas Hou,et al.  Catch you if you lie to me: Efficient verifiable conjunctive keyword search over large dynamic encrypted cloud data , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[11]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[12]  David Suter,et al.  Fast Supervised Hashing with Decision Trees for High-Dimensional Data , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  N. Cao,et al.  Privacy-preserving multi-keyword ranked search over encrypted cloud data , 2011, 2011 Proceedings IEEE INFOCOM.

[14]  Shih-Fu Chang,et al.  Circulant Binary Embedding , 2014, ICML.

[15]  Ming Li,et al.  Authorized Private Keyword Search over Encrypted Data in Cloud Computing , 2011, 2011 31st International Conference on Distributed Computing Systems.

[16]  Panos Kalnis,et al.  Efficient and accurate nearest neighbor and closest pair search in high-dimensional space , 2010, TODS.

[17]  Hong Jiang,et al.  Propeller: A Scalable Real-Time File-Search Service in Distributed Systems , 2014, 2014 IEEE 34th International Conference on Distributed Computing Systems.

[18]  Nenghai Yu,et al.  Complementary hashing for approximate nearest neighbor search , 2011, 2011 International Conference on Computer Vision.

[19]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[20]  Xue Liu,et al.  NEST: Locality-aware approximate query service for cloud computing , 2013, 2013 Proceedings IEEE INFOCOM.

[21]  Wu-Jun Li,et al.  Isotropic Hashing , 2012, NIPS.

[22]  Bo Yu,et al.  Bounded LSH for Similarity Search in Peer-to-Peer File Systems , 2008, 2008 37th International Conference on Parallel Processing.

[23]  Yunhao Liu,et al.  POP: Privacy-Preserving Outsourced Photo Sharing and Searching for Mobile Devices , 2015, 2015 IEEE 35th International Conference on Distributed Computing Systems.

[24]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[25]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[26]  Svetlana Lazebnik,et al.  Locality-sensitive binary codes from shift-invariant kernels , 2009, NIPS.

[27]  Cong Wang,et al.  Generalized pattern matching string search on encrypted data in cloud systems , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[28]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[29]  Aditya Akella,et al.  CLARINET: WAN-Aware Optimization for Analytics Queries , 2016, OSDI.

[30]  Yongdong Zhang,et al.  Data-oriented locality sensitive hashing , 2010, ACM Multimedia.

[31]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[32]  Ying Cai,et al.  A Parity-Based Data Outsourcing Model for Query Authentication and Correction , 2016, 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS).

[33]  Michael J. Cafarella,et al.  Neighbor-Sensitive Hashing , 2015, Proc. VLDB Endow..

[34]  Gautam Kumar,et al.  Hold 'em or fold 'em?: aggregation queries under performance variations , 2016, EuroSys.

[35]  Beng Chin Ooi,et al.  DSH: data sensitive hashing for high-dimensional k-nnsearch , 2014, SIGMOD Conference.

[36]  Alexandr Andoni,et al.  Beyond Locality-Sensitive Hashing , 2013, SODA.

[37]  Tim Kraska,et al.  Automating model search for large scale machine learning , 2015, SoCC.

[38]  Rina Panigrahy,et al.  Entropy based nearest neighbor search in high dimensions , 2005, SODA '06.

[39]  Wilfred Ng,et al.  Locality-sensitive hashing scheme based on dynamic collision counting , 2012, SIGMOD Conference.

[40]  Toshikazu Wada,et al.  Principal Component Hashing: An Accelerated Approximate Nearest Neighbor Search , 2009, PSIVT.

[41]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[42]  Kai Bu,et al.  Efficient distributed query processing in large RFID-enabled supply chains , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[43]  Keke Chen,et al.  Building Confidential and Efficient Query Services in the Cloud with RASP Data Perturbation , 2012, IEEE Transactions on Knowledge and Data Engineering.

[44]  Yuzhe Tang,et al.  Privacy-Preserving Multi-Keyword Search in Information Networks , 2015, IEEE Transactions on Knowledge and Data Engineering.

[45]  Alexandr Andoni,et al.  Optimal Data-Dependent Hashing for Approximate Near Neighbors , 2015, STOC.

[46]  MyungKeun Yoon,et al.  Bloom tree: A search tree based on Bloom filters for multiple-set membership testing , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[47]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.