Pattern Masking for Dictionary Matching

In the Pattern Masking for Dictionary Matching (PMDM) problem, we are given a dictionary $\mathcal{D}$ of $d$ strings, each of length $\ell$, a query string $q$ of length $\ell$, and a positive integer $z$, and we are asked to compute a smallest set $K\subseteq\{1,\ldots,\ell\}$, so that if $q[i]$, for all $i\in K$, is replaced by a wildcard, then $q$ matches at least $z$ strings from $\mathcal{D}$. The PMDM problem lies at the heart of two important applications featured in large-scale real-world systems: record linkage of databases that contain sensitive information, and query term dropping. In both applications, solving PMDM allows for providing data utility guarantees as opposed to existing approaches. We first show, through a reduction from the well-known $k$-Clique problem, that a decision version of the PMDM problem is NP-complete, even for strings over a binary alphabet. We present a data structure for PMDM that answers queries over $\mathcal{D}$ in time $\mathcal{O}(2^{\ell/2}(2^{\ell/2}+\tau)\ell)$ and requires space $\mathcal{O}(2^{\ell}d^2/\tau^2+2^{\ell/2}d)$, for any parameter $\tau\in[1,d]$. We also approach the problem from a more practical perspective. We show an $\mathcal{O}((d\ell)^{k/3}+d\ell)$-time and $\mathcal{O}(d\ell)$-space algorithm for PMDM if $k=|K|=\mathcal{O}(1)$. We generalize our exact algorithm to mask multiple query strings simultaneously. We complement our results by showing a two-way polynomial-time reduction between PMDM and the Minimum Union problem [Chlamtac et al., SODA 2017]. This gives a polynomial-time $\mathcal{O}(d^{1/4+\epsilon})$-approximation algorithm for PMDM, which is tight under plausible complexity conjectures.

[1]  Djamal Belazzougui,et al.  Faster and Space-Optimal Edit Distance "1" Dictionary , 2009, CPM.

[2]  V. V. Williams ON SOME FINE-GRAINED QUESTIONS IN ALGORITHMS AND COMPLEXITY , 2019, Proceedings of the International Congress of Mathematicians (ICM 2018).

[3]  Benny Applebaum,et al.  Pseudorandom generators with long stretch and low locality from random local one-way functions , 2012, STOC '12.

[4]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[5]  Tak Wah Lam,et al.  Compressed Indexes for Approximate String Matching , 2010, Algorithmica.

[6]  Yuval Rabani,et al.  Cell-probe lower bounds for the partial match problem , 2003, STOC '03.

[7]  Costas S. Iliopoulos,et al.  Longest Common Prefixes with k-Errors and Applications , 2018, SPIRE.

[8]  Jesper Sindahl Nielsen,et al.  Data Structure Lower Bounds for Document Indexing Problems , 2016, ICALP.

[9]  Mikkel Thorup,et al.  Higher Lower Bounds for Near-Neighbor and Further Rich Problems , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[10]  Søren Vind,et al.  Motif trie: An efficient text index for pattern discovery with don't cares , 2018, Theor. Comput. Sci..

[11]  Laurent Feuilloley,et al.  Lower bounds for text indexing with mismatches and differences , 2019, SODA.

[12]  David Zuckerman,et al.  Electronic Colloquium on Computational Complexity, Report No. 100 (2005) Linear Degree Extractors and the Inapproximability of MAX CLIQUE and CHROMATIC NUMBER , 2005 .

[13]  Eric D. Ragan,et al.  Balancing Privacy and Information Disclosure in Interactive Record Linkage with Visual Masking , 2018, CHI.

[14]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[15]  Ashwin Machanavajjhala,et al.  Privacy preserving interactive record linkage (PPIRL) , 2014, J. Am. Medical Informatics Assoc..

[16]  Roberto Grossi,et al.  Masking patterns in sequences: A new class of motif discovery with don't cares , 2009, Theor. Comput. Sci..

[17]  Andrew Chi-Chih Yao,et al.  Dictionary Look-Up with One Error , 1997, J. Algorithms.

[18]  Srinivasan Venkatesh,et al.  Improved bounds for dictionary look-up with one error , 2000, Inf. Process. Lett..

[19]  Peter Christen,et al.  Automatic Discovery of Abnormal Values in Large Textual Databases , 2016, ACM J. Data Inf. Qual..

[20]  Erhard Rahm,et al.  Privacy-Preserving Record Linkage for Big Data: Current Approaches and Research Challenges , 2017, Handbook of Big Data Technologies.

[21]  Aris Gkoulalas-Divanis,et al.  Summarizing and linking electronic health records , 2019, Distributed and Parallel Databases.

[22]  Peter Christen,et al.  Scalable Privacy-Preserving Record Linkage for Multiple Databases , 2014, CIKM.

[23]  Ronald L. Rivest,et al.  Partial-Match Retrieval Algorithms , 1976, SIAM J. Comput..

[24]  Richard Ryan Williams,et al.  Tight Hardness for Shortest Cycles and Paths in Sparse Graphs , 2017, SODA.

[25]  Hiroki Arimura,et al.  An efficient polynomial space and polynomial delay algorithm for enumeration of maximal motifs in a sequence , 2007, J. Comb. Optim..

[26]  Jiawei Han,et al.  Efficient Mining of Closed Repetitive Gapped Subsequences from a Sequence Database , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[27]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[28]  J. Håstad Clique is hard to approximate withinn1−ε , 1999 .

[29]  M. Jünger,et al.  50 Years of Integer Programming 1958-2008 - From the Early Years to the State-of-the-Art , 2010 .

[30]  Alberto Apostolico,et al.  Incremental Paradigms of Motif Discovery , 2004, J. Comput. Biol..

[31]  Moshe Lewenstein,et al.  Less space: Indexing for queries with wildcards , 2013, Theor. Comput. Sci..

[32]  Murat Kantarcioglu,et al.  Composite Bloom Filters for Secure Record Linkage , 2014, IEEE Transactions on Knowledge and Data Engineering.

[33]  William E. Winkler,et al.  Data quality and record linkage techniques , 2007 .

[34]  Sreenivas Gollapudi,et al.  Efficient query rewrite for structured web queries , 2011, CIKM '11.

[35]  Yufei Tao,et al.  Entity Matching with Active Monotone Classification , 2018, PODS.

[36]  Michael Dinitz,et al.  The Densest k-Subhypergraph Problem , 2016, APPROX-RANDOM.

[37]  George Papadakis,et al.  Blocking and Filtering Techniques for Entity Resolution , 2019, ACM Comput. Surv..

[38]  Piotr Indyk,et al.  New Algorithms for Subset Query, Partial Match, Orthogonal Range Searching, and Related Problems , 2002, ICALP.

[39]  Richard Cole,et al.  Dictionary matching and indexing with errors and don't cares , 2004, STOC '04.

[40]  Peter Bro Miltersen,et al.  On data structures and asymmetric communication complexity , 1994, STOC '95.

[41]  Michael Dinitz,et al.  Minimizing the Union: Tight Approximations for Small Set Bipartite Vertex Expansion , 2016, SODA.

[42]  Russell Impagliazzo,et al.  Complexity of k-SAT , 1999, Proceedings. Fourteenth Annual IEEE Conference on Computational Complexity (Formerly: Structure in Complexity Theory Conference) (Cat.No.99CB36317).

[43]  Rossano Venturini,et al.  Compressed String Dictionary Search with Edit Distance One , 2015, Algorithmica.

[44]  Ge Xia,et al.  Strong computational lower bounds via parameterized complexity , 2006, J. Comput. Syst. Sci..

[45]  Allan Borodin,et al.  Lower bounds for high dimensional nearest neighbor search and related problems , 1999, STOC '99.

[46]  Hideo Bannai,et al.  General Algorithms for Mining Closed Flexible Patterns under Various Equivalence Relations , 2012, ECML/PKDD.

[47]  Moshe Lewenstein,et al.  Space-Efficient String Indexing for Wildcard Pattern Matching , 2014, STACS.

[48]  L FredmanMichael,et al.  Storing a Sparse Table with 0(1) Worst Case Access Time , 1984 .

[49]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[50]  Yuan Gao,et al.  Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm , 2000, SODA '00.

[51]  Maxime Crochemore,et al.  Bases of motifs for generating repeated patterns with wild cards , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[52]  Hua Yang,et al.  Query Rewrite for Null and Low Search Results in eCommerce , 2017, eCOM@SIGIR.

[53]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[54]  Mihai Patrascu,et al.  Unifying the Landscape of Cell-Probe Lower Bounds , 2010, SIAM J. Comput..

[55]  Philip Bille,et al.  String Indexing for Patterns with Wildcards , 2011, Theory of Computing Systems.

[56]  Martha Bailey,et al.  How Well Do Automated Linking Methods Perform? Lessons from U.S. Historical Data , 2017, Journal of economic literature.

[57]  Eric D. Ragan,et al.  Enhancing Privacy through an Interactive On-demand Incremental Information Disclosure Interface: Applying Privacy-by-Design to Record Linkage , 2019, SOUPS @ USENIX Security Symposium.