String Sanitization Under Edit Distance

textabstractLet W be a string of length n over an alphabet Σ, k be a positive integer, and be a set of length-k substrings of W. The ETFS problem asks us to construct a string X_{ED} such that: (i) no string of occurs in X_{ED}; (ii) the order of all other length-k substrings over Σ is the same in W and in X_{ED}; and (iii) X_{ED} has minimal edit distance to W. When W represents an individual’s data and represents a set of confidential substrings, algorithms solving ETFS can be applied for utility-preserving string sanitization [Bernardini et al., ECML PKDD 2019]. Our first result here is an algorithm to solve ETFS in (kn²) time, which improves on the state of the art [Bernardini et al., arXiv 2019] by a factor of |Σ|. Our algorithm is based on a non-trivial modification of the classic dynamic programming algorithm for computing the edit distance between two strings. Notably, we also show that ETFS cannot be solved in (n^{2-δ}) time, for any δ>0, unless the strong exponential time hypothesis is false. To achieve this, we reduce the edit distance problem, which is known to admit the same conditional lower bound [Bringmann and Kunnemann, FOCS 2015], to ETFS.

[1]  Bradley Malin,et al.  Determining the identifiability of DNA database entries , 2000, AMIA.

[2]  Russell Impagliazzo,et al.  Which problems have strongly exponential complexity? , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[3]  Beng Chin Ooi,et al.  Efficiently Supporting Edit Distance Based String Similarity Search Using B $^+$-Trees , 2014, IEEE Trans. Knowl. Data Eng..

[4]  Solon P. Pissis,et al.  Reverse-Safe Data Structures for Text Indexing , 2020, ALENEX.

[5]  Chen Li,et al.  SEPIA: estimating selectivities of approximate string predicates in large Databases , 2008, The VLDB Journal.

[6]  Hongxia Jin,et al.  An Information-Theoretic Approach to Individual Sequential Data Sanitization , 2016, WSDM.

[7]  Francesco Bonchi,et al.  Hiding Sequential and Spatiotemporal Patterns , 2010, IEEE Transactions on Knowledge and Data Engineering.

[8]  Russell Impagliazzo,et al.  Complexity of k-SAT , 1999, Proceedings. Fourteenth Annual IEEE Conference on Computational Complexity (Formerly: Structure in Complexity Theory Conference) (Cat.No.99CB36317).

[9]  Robert Gwadera,et al.  Permutation-Based Sequential Pattern Hiding , 2013, 2013 IEEE 13th International Conference on Data Mining.

[10]  A. Schuchat DEPARTMENT OF HEALTH & HUMAN SERVICES , 2015 .

[11]  Robert Gwadera,et al.  Optimal event sequence sanitization , 2015, SDM.

[12]  Lu Li,et al.  Efficient secure similarity computation on encrypted trajectory data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[13]  Heng Xu,et al.  Information Privacy Research: An Interdisciplinary Review , 2011, MIS Q..

[14]  Jiawei Han,et al.  MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance , 2016, SDM.

[15]  Nikos Mamoulis,et al.  Local Suppression and Splitting Techniques for Privacy Preserving Publication of Trajectories , 2017, IEEE Transactions on Knowledge and Data Engineering.

[16]  E. Myers,et al.  Approximate matching of regular expressions. , 1989, Bulletin of mathematical biology.

[17]  Marvin Künnemann,et al.  Quadratic Conditional Lower Bounds for String Problems and Dynamic Time Warping , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[18]  Piotr Indyk,et al.  Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false) , 2014, STOC.

[19]  Spiros Skiadopoulos,et al.  Apriori-based algorithms for km-anonymizing trajectory data , 2014, Trans. Data Priv..

[20]  Roberto Grossi,et al.  String Sanitization: A Combinatorial Approach , 2019, ECML/PKDD.

[21]  Zeyi Wen,et al.  2ED: An Efficient Entity Extraction Algorithm Using Two-Level Edit-Distance , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[22]  Agustí Verde Parera,et al.  General data protection regulation , 2018 .

[23]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .