Pigeonring: A Principle for Faster Thresholded Similarity Search

The pigeonhole principle states that if $n$ items are contained in $m$ boxes, then at least one box has no more than $n / m$ items. It is utilized to solve many data management problems, especially for thresholded similarity searches. Despite many pigeonhole principle-based solutions proposed in the last few decades, the condition stated by the principle is weak. It only constrains the number of items in a single box. By organizing the boxes in a ring, we propose a new principle, called the pigeonring principle, which constrains the number of items in multiple boxes and yields stronger conditions. To utilize the new principle, we focus on problems defined in the form of identifying data objects whose similarities or distances to the query is constrained by a threshold. Many solutions to these problems utilize the pigeonhole principle to find candidates that satisfy a filtering condition. By the new principle, stronger filtering conditions can be established. We show that the pigeonhole principle is a special case of the new principle. This suggests that all the pigeonhole principle-based solutions are possible to be accelerated by the new principle. A universal filtering framework is introduced to encompass the solutions to these problems based on the new principle. Besides, we discuss how to quickly find candidates specified by the new principle. The implementation requires only minor modifications on top of existing pigeonhole principle-based algorithms. Experimental results on real datasets demonstrate the applicability of the new principle as well as the superior performance of the algorithms based on the new principle.

[1]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[2]  Kyle Fox,et al.  A simple efficient approximation algorithm for dynamic time warping , 2016, SIGSPATIAL/GIS.

[3]  Eric Torng,et al.  Large scale Hamming distance query processing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[4]  Beng Chin Ooi,et al.  Indexing the edges—a simple and yet efficient approach to high-dimensional indexing , 2000, PODS.

[5]  Eleftherios Tiakas,et al.  Scalable Trajectory Similarity Search Based on Locations in Spatial Networks , 2015, MEDI.

[6]  Guoliang Li,et al.  Extending string similarity join to tolerant fuzzy token matching , 2014, ACM Trans. Database Syst..

[7]  Jan Krajícek,et al.  An Exponenetioal Lower Bound to the Size of Bounded Depth Frege Proofs of the Pigeonhole Principle , 1995, Random Struct. Algorithms.

[8]  Bin Wang,et al.  ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases , 2012, Proc. VLDB Endow..

[9]  Yannis Theodoridis,et al.  Index-based Most Similar Trajectory Search , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[10]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[11]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[12]  Guoliang Li,et al.  Signature-Based Trajectory Similarity Join , 2017, IEEE Transactions on Knowledge and Data Engineering.

[13]  Guoliang Li,et al.  String similarity search and join: a survey , 2016, Frontiers of Computer Science.

[14]  Guoliang Li,et al.  A pivotal prefix based filtering algorithm for string similarity search , 2014, SIGMOD Conference.

[15]  Yongdong Zhang,et al.  Efficient approximate nearest neighbor search with integrated binary codes , 2011, ACM Multimedia.

[16]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[17]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[18]  Anthony K. H. Tung,et al.  Piers: an efficient model for similarity search in DNA sequence databases , 2004, SGMD.

[19]  Guoliang Li,et al.  Trie-join: a trie-based method for efficient string similarity joins , 2012, The VLDB Journal.

[20]  Jin Wang,et al.  A unified framework for string similarity search with edit-distance constraint , 2016, The VLDB Journal.

[21]  Guoliang Li,et al.  A partition-based method for string similarity joins with edit-distance constraints , 2013, TODS.

[22]  Xuemin Lin,et al.  SRS: Solving c-Approximate Nearest Neighbor Queries in High Dimensional Euclidean Space with a Tiny Index , 2014, Proc. VLDB Endow..

[23]  Beng Chin Ooi,et al.  Efficiently Supporting Edit Distance Based String Similarity Search Using B $^+$-Trees , 2014, IEEE Trans. Knowl. Data Eng..

[24]  Siu-Ming Yiu,et al.  Compressed indexing and local alignment of DNA , 2008, Bioinform..

[25]  Panos Kalnis,et al.  Trajectory Similarity Join in Spatial Networks , 2017, Proc. VLDB Endow..

[26]  Hui Ding,et al.  Querying and mining of time series data: experimental comparison of representations and distance measures , 2008, Proc. VLDB Endow..

[27]  Wen-Syan Li,et al.  Top-k string similarity search with edit-distance constraints , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[28]  Guoliang Li,et al.  PASS-JOIN: A Partition-based Method for Similarity Joins , 2011, Proc. VLDB Endow..

[29]  Rasmus Pagh,et al.  Set similarity search beyond MinHash , 2017, STOC.

[30]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[31]  Anthony K. H. Tung,et al.  Efficient and effective similarity search over probabilistic data based on Earth Mover’s Distance , 2010, The VLDB Journal.

[32]  Xuemin Lin,et al.  Ed-Join: an efficient algorithm for similarity joins with edit distance constraints , 2008, Proc. VLDB Endow..

[33]  Wilfred Ng,et al.  Locality-sensitive hashing scheme based on dynamic collision counting , 2012, SIGMOD Conference.

[34]  Jan Krajícek,et al.  Proof complexity in algebraic systems and bounded depth Frege systems with modular counting , 1997, computational complexity.

[35]  Dennis Shasha,et al.  Warping indexes with envelope transforms for query by humming , 2003, SIGMOD '03.

[36]  Anthony K. H. Tung,et al.  Efficient and Effective KNN Sequence Search with Approximate n-grams , 2013, Proc. VLDB Endow..

[37]  Hanan Samet,et al.  Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.

[38]  Xuemin Lin,et al.  Efficient processing of graph similarity queries with edit distance constraints , 2013, The VLDB Journal.

[39]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[40]  Yang Wang,et al.  Efficient structure similarity searches: a partition-based approach , 2018, The VLDB Journal.

[41]  Divesh Srivastava,et al.  Fast Indexes and Algorithms for Set Similarity Selection Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[42]  Eamonn J. Keogh,et al.  Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping , 2012, KDD.

[43]  Dimitrios Gunopulos,et al.  Reference-Based Alignment in Large Sequence Databases , 2009, Proc. VLDB Endow..

[44]  Nikolaus Augsten,et al.  PEL: Position-Enhanced Length Filter for Set Similarity Joins , 2014, Grundlagen von Datenbanken.

[45]  Wen-Syan Li,et al.  String Similarity Joins: An Experimental Evaluation , 2014, Proc. VLDB Endow..

[46]  Bernard Chazelle,et al.  Faster dimension reduction , 2010, Commun. ACM.

[47]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[48]  Lijun Chang,et al.  Leveraging Set Relations in Exact Set Similarity Join , 2017, Proc. VLDB Endow..

[49]  Anthony K. H. Tung,et al.  Similarity Search on Bregman Divergence: Towards Non-Metric Indexing , 2009, Proc. VLDB Endow..

[50]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[51]  Alexandr Andoni,et al.  Practical and Optimal LSH for Angular Distance , 2015, NIPS.

[52]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[53]  Michael Stonebraker,et al.  SilkMoth: An Efficient Method for Finding Related Sets with Maximum Matching Constraints , 2017, Proc. VLDB Endow..

[54]  Jeffrey Xu Yu,et al.  Connected substructure similarity search , 2010, SIGMOD Conference.

[55]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[56]  Yufei Tao,et al.  Overlap Set Similarity Joins with Theoretical Guarantees , 2018, SIGMOD Conference.

[57]  Rasmus Pagh,et al.  Scalable and Robust Set Similarity Join , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[58]  Wei Wang,et al.  Efficient Approximate Entity Matching Using Jaro-Winkler Distance , 2017, WISE.

[59]  Elke A. Rundensteiner,et al.  Interactive Time Series Exploration Powered by the Marriage of Similarity Distances , 2016, Proc. VLDB Endow..

[60]  Jignesh M. Patel,et al.  OASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences , 2003, VLDB.

[61]  Lei Zou,et al.  Efficient Graph Similarity Search Over Large Graph Databases , 2015, IEEE Transactions on Knowledge and Data Engineering.

[62]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[63]  George Karypis,et al.  L2AP: Fast cosine similarity search with prefix L-2 norm bounds , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[64]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[65]  A. Meyers Reading , 1999, Language Teaching.

[66]  Aviezri S. Fraenkel,et al.  A hash code method for detecting and correcting spelling errors , 1982, CACM.

[67]  Eamonn J. Keogh,et al.  Experimental comparison of representation methods and distance measures for time series data , 2010, Data Mining and Knowledge Discovery.

[68]  Lei Chen,et al.  Robust and fast similarity search for moving object trajectories , 2005, SIGMOD '05.

[69]  Jiaheng Lu,et al.  Efficient Merging and Filtering Algorithms for Approximate String Searches , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[70]  Eamonn J. Keogh,et al.  Scaling and time warping in time series querying , 2005, The VLDB Journal.

[71]  Jeff B. Paris,et al.  Provability of the Pigeonhole Principle and the Existence of Infinitely Many Primes , 1988, J. Symb. Log..

[72]  Jiaheng Lu,et al.  HmSearch: an efficient hamming distance query processing algorithm , 2013, SSDBM.

[73]  Eamonn Keogh Exact Indexing of Dynamic Time Warping , 2002, VLDB.

[74]  Chengqi Zhang,et al.  Efficient approximate entity extraction with edit distance constraints , 2009, SIGMOD Conference.

[75]  Alexander A. Razborov,et al.  Proof Complexity of Pigeonhole Principles , 2001, Developments in Language Theory.

[76]  Kenneth P. Bogart,et al.  Introductory Combinatorics , 1977 .

[77]  T. Apostol Modular Functions and Dirichlet Series in Number Theory , 1976 .

[78]  Philip S. Yu,et al.  Substructure similarity search in graph databases , 2005, SIGMOD '05.

[79]  Christos Faloutsos,et al.  Efficient retrieval of similar time sequences under time warping , 1998, Proceedings 14th International Conference on Data Engineering.

[80]  Anthony K. H. Tung,et al.  Comparing Stars: On Approximating Graph Edit Distance , 2009, Proc. VLDB Endow..

[81]  Bin Wang,et al.  VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams , 2007, VLDB.

[82]  Panos Kalnis,et al.  Efficient and accurate nearest neighbor and closest pair search in high-dimensional space , 2010, TODS.

[83]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[84]  Clement T. Yu,et al.  Haar Wavelets for Efficient Similarity Search of Time-Series: With and Without Time Warping , 2003, IEEE Trans. Knowl. Data Eng..

[85]  Johannes Gehrke,et al.  ATLAS: a probabilistic algorithm for high dimensional similarity search , 2011, SIGMOD '11.

[86]  Bin Wang,et al.  Efficient direct search on compressed genomic data , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[87]  Shuigeng Zhou,et al.  PRAGUE: Towards Blending Practical Visual Subgraph Query Formulation and Query Processing , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[88]  Bohyung Han,et al.  A fast nearest neighbor search algorithm by nonlinear embedding , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[89]  Zhifeng Bao,et al.  DITA: Distributed In-Memory Trajectory Analytics , 2018, SIGMOD Conference.

[90]  Yasuo Tabei,et al.  Single versus Multiple Sorting in All Pairs Similarity Search , 2010, ACML.

[91]  Ge Yu,et al.  Efficiently Indexing Large Sparse Graphs for Similarity Search , 2012, IEEE Transactions on Knowledge and Data Engineering.

[92]  Beng Chin Ooi,et al.  Making the pyramid technique robust to query types and workloads , 2004, Proceedings. 20th International Conference on Data Engineering.

[93]  R. R. Eilers,et al.  Polynomial Size Proofs for the Propositional Pigeonhole Principle , 2014 .

[94]  Kaspar Riesen,et al.  IAM Graph Database Repository for Graph Based Pattern Recognition and Machine Learning , 2008, SSPR/SPR.

[95]  J. Shane Culpepper,et al.  Torch: A Search Engine for Trajectory Data , 2018, SIGIR.

[96]  Russell Impagliazzo,et al.  Exponential lower bounds for the pigeonhole principle , 1992, STOC '92.

[97]  Matthijs Douze,et al.  Searching in one billion vectors: Re-rank with source coding , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[98]  Yoshiharu Ishikawa,et al.  Local Similarity Search for Unstructured Text , 2016, SIGMOD Conference.

[99]  Nikos Mamoulis,et al.  Spatio-textual similarity joins , 2012, Proc. VLDB Endow..

[100]  Jeffrey Xu Yu,et al.  String Similarity Search: A Hash-Based Approach , 2018, IEEE Transactions on Knowledge and Data Engineering.

[101]  Reynold Cheng,et al.  Earth Mover's Distance based Similarity Search at Scale , 2013, Proc. VLDB Endow..

[102]  Anthony K. H. Tung,et al.  An Efficient Graph Indexing Method , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[103]  Heng Tao Shen,et al.  VChunkJoin: An Efficient Algorithm for Edit Similarity Joins , 2013, IEEE Transactions on Knowledge and Data Engineering.

[104]  Wei Wang,et al.  GPH: Similarity Search in Hamming Space , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[105]  Peixiang Zhao,et al.  Similarity Search in Graph Databases: A Multi-Layered Indexing Approach , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[106]  Bin Wang,et al.  Local Filtering: Improving the Performance of Approximate Queries on String Collections , 2015, SIGMOD Conference.

[107]  Theo Härder,et al.  Generalizing prefix filtering to improve set similarity joins , 2011, Inf. Syst..

[108]  Beng Chin Ooi,et al.  Bed-tree: an all-purpose index structure for string similarity search based on edit distance , 2010, SIGMOD Conference.

[109]  Bin Wang,et al.  Cost-based variable-length-gram selection for string collections to support approximate queries efficiently , 2008, SIGMOD Conference.

[110]  Wesley W. Chu,et al.  An index-based approach for similarity search supporting time warping in large sequence databases , 2001, Proceedings 17th International Conference on Data Engineering.

[111]  Feifei Li,et al.  Distributed Trajectory Similarity Search , 2017, Proc. VLDB Endow..

[112]  Jianzhong Li,et al.  Set-based Similarity Search for Time Series , 2016, SIGMOD Conference.

[113]  Srinivasan Parthasarathy,et al.  Bayesian Locality Sensitive Hashing for Fast Similarity Search , 2011, Proc. VLDB Endow..

[114]  Ryutaro Ichise,et al.  Similarity search on supergraph containment , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[115]  Beng Chin Ooi,et al.  iDistance: An adaptive B+-tree based indexing method for nearest neighbor search , 2005, TODS.

[116]  Nikolaus Augsten,et al.  An Empirical Evaluation of Set Similarity Join Techniques , 2016, Proc. VLDB Endow..

[117]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[118]  Christos Faloutsos,et al.  FTW: fast similarity search under the time warping distance , 2005, PODS.

[119]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[120]  Heng Tao Shen,et al.  Hashing for Similarity Search: A Survey , 2014, ArXiv.

[121]  Pavlos Protopapas,et al.  Supporting exact indexing of arbitrarily rotated shapes and periodic time series under Euclidean and warping distance measures , 2008, The VLDB Journal.

[122]  Miklós Ajtai,et al.  The complexity of the Pigeonhole Principle , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[123]  Sriram Raghavan,et al.  Indexing and matching trajectories under inconsistent sampling rates , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[124]  Felix Naumann,et al.  Efficient Similarity Search in Very Large String Sets , 2012, SSDBM.

[125]  Guoliang Li,et al.  Can we beat the prefix filtering?: an adaptive framework for similarity join and search , 2012, SIGMOD Conference.

[126]  David J. Fleet,et al.  Fast search in Hamming space with multi-index hashing , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[127]  Dimitrios Gunopulos,et al.  Discovering similar multidimensional trajectories , 2002, Proceedings 18th International Conference on Data Engineering.

[128]  Guoliang Li,et al.  An Efficient Partition Based Method for Exact Set Similarity Joins , 2015, Proc. VLDB Endow..

[129]  Yannis Manolopoulos,et al.  Searching for similar trajectories in spatial networks , 2009, J. Syst. Softw..

[130]  Hans-Peter Kriegel,et al.  The pyramid-technique: towards breaking the curse of dimensionality , 1998, SIGMOD '98.

[131]  Ying Zhang,et al.  An Efficient Framework for Exact Set Similarity Search Using Tree Structure Indexes , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[132]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[133]  Lei Chen,et al.  On The Marriage of Lp-norms and Edit Distance , 2004, VLDB.

[134]  Armin Haken,et al.  The Intractability of Resolution , 1985, Theor. Comput. Sci..

[135]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[136]  Haixun Wang,et al.  Asymmetric signature schemes for efficient exact edit similarity query processing , 2013, TODS.

[137]  Miroslaw Bober,et al.  Improved Hamming Distance Search Using Variable Length Hashing , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[138]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).