Bit-Vector Search Filtering with Application to a Kanji Dictionary

Database query problems can be categorized by the expressiveness of their query languages, and data structure bounds are better for less expressive languages. Highly expressive languages, such as those permitting Boolean operations, lead to difficult query problems with poor bounds, and high dimensionality in geometric problems also causes their query languages to become expressive and inefficient. The IDSgrep kanji dictionary software approaches a highly expressive tree-matching query problem with a filtering technique set in 128-bit Hamming space. It can be a model for other highly expressive query languages. We suggest improvements to bit vector filtering of general applicability, and evaluate them in the context of IDSgrep.

[1]  E. Ott Chaos in Dynamical Systems: Contents , 2002 .

[2]  Matthew Skala On the complexity of reverse similarity search , 2008, ICDE Workshops.

[3]  Gerald Penn,et al.  Approximate Bit Vectors for Fast Unification , 2011, MOL.

[4]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[5]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[6]  Ami Litman,et al.  On covering problems of codes , 1997, Theory of Computing Systems.

[7]  Yong Suk Choi,et al.  Tree pattern expression for extracting information from syntactically parsed text corpora , 2010, Data Mining and Knowledge Discovery.

[8]  Roger Levy,et al.  Tregex and Tsurgeon: tools for querying and manipulating tree data structures , 2006, LREC.

[9]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[10]  Ryan Williams,et al.  A new algorithm for optimal 2-constraint satisfaction and its implications , 2005, Theor. Comput. Sci..

[11]  Matthew Skala,et al.  Measuring the Difficulty of Distance-Based Indexing , 2005, SPIRE.

[12]  Gerald Penn,et al.  A Generalized-Zero-Preserving Method for Compact Encoding of Concept Lattices , 2010, ACL.

[13]  Patrick Lincoln,et al.  Efficient implementation of lattice operations , 1989, TOPL.

[14]  Matthew Skala,et al.  Aspects of Metric Spaces in Computation , 2008 .

[15]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[16]  Hiroki Arimura,et al.  Faster bit-parallel algorithms for unordered pseudo-tree matching and tree homeomorphism , 2012, J. Discrete Algorithms.

[17]  Matthew Skala,et al.  A Structural Query System for Han Characters , 2014, ArXiv.

[18]  Max Bramer,et al.  Logic Programming with Prolog , 2005, Springer London.

[19]  Jie Wu,et al.  The Dynamic Bloom Filters , 2010, IEEE Transactions on Knowledge and Data Engineering.

[20]  Catherine Lai,et al.  Querying Linguistic Trees , 2009, J. Log. Lang. Inf..