Reverse-Safe Text Indexing

We introduce the notion of reverse-safe data structures. These are data structures that prevent the reconstruction of the data they encode (i.e., they cannot be easily reversed). A data structure D is called z-reverse-safe when there exist at least z datasets with the same set of answers as the ones stored by D. The main challenge is to ensure that D stores as many answers to useful queries as possible, is constructed efficiently, and has size close to the size of the original dataset it encodes. Given a text of length n and an integer z, we propose an algorithm that constructs a z-reverse-safe data structure (z-RSDS) that has size O(n) and answers decision and counting pattern matching queries of length at most d optimally, where d is maximal for any such z-RSDS. The construction algorithm takes O(nɷ log d) time, where ɷ is the matrix multiplication exponent. We show that, despite the nɷ factor, our engineered implementation takes only a few minutes to finish for million-letter texts. We also show that plugging our method in data analysis applications gives insignificant or no data utility loss. Furthermore, we show how our technique can be extended to support applications under realistic adversary models. Finally, we show a z-RSDS for decision pattern matching queries, whose size can be sublinear in n. A preliminary version of this article appeared in ALENEX 2020.

[1]  A. Restivo,et al.  Data compression using antidictionaries , 2000, Proceedings of the IEEE.

[2]  Antonio Restivo,et al.  Words and forbidden factors , 2002, Theor. Comput. Sci..

[3]  Maxime Crochemore,et al.  Absent words in a sliding window with applications , 2020, Inf. Comput..

[4]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[5]  Solon P. Pissis,et al.  Linear-time computation of minimal absent words using suffix array , 2014, BMC Bioinformatics.

[6]  Cong Wang,et al.  Efficient verifiable fuzzy keyword search over encrypted data in cloud computing , 2013, Comput. Sci. Inf. Syst..

[7]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[8]  Elisa Bertino,et al.  Access Control for Databases: Concepts and Systems , 2011, Found. Trends Databases.

[9]  Peter Triantafillou,et al.  Indexing Query Graphs to Speedup Graph Query Processing , 2016, EDBT.

[10]  Rajeev Motwani,et al.  A Survey of Query Auditing Techniques for Data Privacy , 2008, Privacy-Preserving Data Mining.

[11]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[12]  Roberto Grossi,et al.  String Sanitization: A Combinatorial Approach , 2019, ECML/PKDD.

[13]  D. Bentley,et al.  Whole-genome re-sequencing. , 2006, Current opinion in genetics & development.

[14]  Solon P. Pissis,et al.  Parallelising the Computation of Minimal Absent Words , 2015, PPAM.

[15]  Maxime Crochemore,et al.  Using minimal absent words to build phylogeny , 2012, Theor. Comput. Sci..

[16]  Tina Hernandez-Boussard,et al.  Weakly supervised natural language processing for assessing patient-centered outcome following prostate cancer treatment , 2019, JAMIA open.

[17]  Antonio Restivo,et al.  Word assembly through minimal forbidden words , 2006, Theor. Comput. Sci..

[18]  Aldo de Luca,et al.  Words and special factors , 2001, Theor. Comput. Sci..

[19]  Jiawei Han,et al.  MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance , 2016, SDM.

[20]  Juhani Karhumäki,et al.  On a generalization of Abelian equivalence and complexity of infinite words , 2013, J. Comb. Theory, Ser. A.

[21]  Joseph Y. Halpern,et al.  From Statistical Knowledge Bases to Degrees of Belief , 1996, Artif. Intell..

[22]  Claude Castelluccia,et al.  Differentially private sequential data publication via variable-length n-grams , 2012, CCS.

[23]  LiuChang,et al.  Property Suffix Array with Applications in Indexing Weighted Sequences , 2020 .

[24]  Nikos Mamoulis,et al.  Privacy Preservation in the Publication of Trajectories , 2008, The Ninth International Conference on Mobile Data Management (mdm 2008).

[25]  Panos Kalnis,et al.  Local and global recoding methods for anonymizing set-valued data , 2010, The VLDB Journal.

[26]  Leen Stougie,et al.  String Sanitization Under Edit Distance , 2020, CPM.

[27]  Rajeev Raman Encoding Data Structures , 2015, WALCOM.

[28]  B. Malin,et al.  Anonymization of electronic medical records for validating genome-wide association studies , 2010, Proceedings of the National Academy of Sciences.

[29]  Armando J. Pinho,et al.  Three minimal sequences found in Ebola virus genomes and absent from human DNA , 2015, Bioinform..

[30]  Xiang Cheng,et al.  Differentially Private Frequent Sequence Mining , 2016, IEEE Transactions on Knowledge and Data Engineering.

[31]  Nikos Mamoulis,et al.  Local Suppression and Splitting Techniques for Privacy Preserving Publication of Trajectories , 2017, IEEE Transactions on Knowledge and Data Engineering.

[32]  Diogo Pratas,et al.  Persistent minimal sequences of SARS-CoV-2 , 2020, Bioinform..

[33]  Robert E. Tarjan,et al.  Storing a sparse table , 1979, CACM.

[34]  Agustí Verde Parera,et al.  General data protection regulation , 2018 .

[35]  Thierry Lecroq,et al.  Linking indexing data structures to de Bruijn graphs: Construction and update , 2019, J. Comput. Syst. Sci..

[36]  Armando J. Pinho,et al.  Minimal Absent Words in Prokaryotic and Eukaryotic Genomes , 2011, PloS one.

[37]  Maxime Crochemore,et al.  On Extended Special Factors of a Word , 2018, SPIRE.

[38]  Dino Pedreschi,et al.  Anonymity preserving pattern discovery , 2008, The VLDB Journal.

[39]  Benjamin C. M. Fung,et al.  Centralized and Distributed Anonymization for High-Dimensional Healthcare Data , 2010, TKDD.

[40]  Joong Chae Na,et al.  Truncated suffix trees and their application to data compression , 2003, Theor. Comput. Sci..

[41]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[42]  Benjamin C. M. Fung,et al.  Differentially private transit data publication: a case study on the montreal transportation system , 2012, KDD.

[43]  Gad M. Landau,et al.  Construction of Aho Corasick automaton in linear time for integer alphabets , 2006, Inf. Process. Lett..

[44]  Volker Heun,et al.  Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays , 2011, SIAM J. Comput..

[45]  Wojciech Rytter,et al.  Computing the Longest Previous Factor , 2013, Eur. J. Comb..

[46]  François Le Gall,et al.  Powers of tensors and fast matrix multiplication , 2014, ISSAC.

[47]  Christian Böhm,et al.  The k-Nearest Neighbour Join: Turbo Charging the KDD Process , 2004, Knowledge and Information Systems.

[48]  Rajeev Raman,et al.  Asymptotically Optimal Encodings of Range Data Structures for Selection and Top-k Queries , 2017, ACM Trans. Algorithms.

[49]  Antonio Restivo,et al.  Automata and Forbidden Words , 1998, Inf. Process. Lett..

[50]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[51]  Joshua C. Denny,et al.  Enabling Genomic-Phenomic Association Discovery without Sacrificing Anonymity , 2013, PloS one.

[52]  Mihai Pop,et al.  Assembly complexity of prokaryotic genomes using short reads , 2010, BMC Bioinformatics.

[53]  Yongchao Liu,et al.  A greedy alignment-free distance estimator for phylogenetic inference , 2015, 2015 IEEE 5th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS).

[54]  Virginia Vassilevska Williams,et al.  Multiplying matrices faster than coppersmith-winograd , 2012, STOC '12.

[55]  Hao Wang,et al.  Privacy-Preserving Wildcards Pattern Matching Protocol for IoT Applications , 2019, IEEE Access.

[56]  SkiadopoulosSpiros,et al.  Apriori-based algorithms for km-anonymizing trajectory data , 2014 .

[57]  Charles J. Colbourn,et al.  Two Algorithms for Unranking Arborescences , 1996, J. Algorithms.

[58]  Hiroki Arimura,et al.  An efficient polynomial space and polynomial delay algorithm for enumeration of maximal motifs in a sequence , 2007, J. Comb. Optim..

[59]  J. Gilbert,et al.  Sparse Partial Pivoting in Time Proportional to Arithmetic Operations , 1986 .

[60]  Carmela Troncoso,et al.  Prolonging the Hide-and-Seek Game: Optimal Trajectory Privacy for Location-Based Services , 2014, WPES.

[61]  Hongxia Jin,et al.  An Information-Theoretic Approach to Individual Sequential Data Sanitization , 2016, WSDM.

[62]  Burkhard Morgenstern,et al.  kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison , 2014, Bioinform..

[63]  Hideo Bannai,et al.  Minimal Unique Substrings and Minimal Absent Words in a Sliding Window , 2020, SOFSEM.

[64]  Yufei Tao,et al.  Transparent anonymization: Thwarting adversaries who know the algorithm , 2010, TODS.

[65]  Hideo Bannai,et al.  Computing DAWGs and Minimal Absent Words in Linear Time for Integer Alphabets , 2016, MFCS.

[66]  Robert Gwadera,et al.  Optimal event sequence sanitization , 2015, SDM.

[67]  Maxime Crochemore,et al.  Alignment-free sequence comparison using absent words , 2018, Inf. Comput..

[68]  Robert Gwadera,et al.  Permutation-Based Sequential Pattern Hiding , 2013, 2013 IEEE 13th International Conference on Data Mining.

[69]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[70]  Donald Ervin Knuth,et al.  The Art of Computer Programming, Volume II: Seminumerical Algorithms , 1970 .

[71]  Bradley Malin,et al.  Determining the identifiability of DNA database entries , 2000, AMIA.

[72]  David Burstein,et al.  The Average Common Substring Approach to Phylogenomic Reconstruction , 2006, J. Comput. Biol..

[73]  Solon P. Pissis,et al.  Constructing Antidictionaries in Output-Sensitive Space , 2019, 2019 Data Compression Conference (DCC).

[74]  Rajeev Raman,et al.  Encoding nearest larger values , 2018, Theor. Comput. Sci..

[75]  Antonio Restivo,et al.  Minimal forbidden factors of circular words , 2019, Theor. Comput. Sci..

[76]  Jeffrey F. Naughton,et al.  Utility-maximizing event stream suppression , 2013, SIGMOD '13.

[77]  Martin Farach-Colton,et al.  Optimal Suffix Tree Construction with Large Alphabets , 1997, FOCS.

[78]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[79]  Heng Xu,et al.  Information Privacy Research: An Interdisciplinary Review , 2011, MIS Q..

[80]  L. Stougie,et al.  Hide and Mine in Strings: Hardness and Algorithms , 2020, 2020 IEEE International Conference on Data Mining (ICDM).

[81]  Spiros Skiadopoulos,et al.  Apriori-based algorithms for km-anonymizing trajectory data , 2014, Trans. Data Priv..

[82]  Dmitry Kosolobov,et al.  Compressed Multiple Pattern Matching , 2018, CPM.

[83]  Pawel Gawrychowski,et al.  Minimal Absent Words in Rooted and Unrooted Trees , 2019, SPIRE.

[84]  Massih-Reza Amini,et al.  KASANDR: A Large-Scale Dataset with Implicit Feedback for Recommendation , 2017, SIGIR.

[85]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD 2000.

[86]  Solon P. Pissis,et al.  Reverse-Safe Data Structures for Text Indexing , 2020, ALENEX.

[87]  Aris Gkoulalas-Divanis,et al.  Revisiting sequential pattern hiding to enhance utility , 2011, KDD.

[88]  Juhani Karhumäki,et al.  On cardinalities of k-abelian equivalence classes , 2016, Theor. Comput. Sci..

[89]  Tsern-Huei Lee,et al.  Using String Matching for Deep Packet Inspection , 2008, Computer.

[90]  Costas S. Iliopoulos,et al.  Property Suffix Array with Applications in Indexing Weighted Sequences , 2020, ACM J. Exp. Algorithmics.

[91]  Pierangela Samarati,et al.  Generalizing Data to Provide Anonymity when Disclosing Information , 1998, PODS 1998.

[92]  János Komlós,et al.  Storing a sparse table with O(1) worst case access time , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[93]  Hiroyoshi Morita,et al.  On the adaptive antidictionary code using minimal forbidden words with constant lengths , 2010, 2010 International Symposium On Information Theory & Its Applications.

[94]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[95]  Rui Li,et al.  Privacy Preserving String Matching for Cloud Computing , 2015, 2015 IEEE 35th International Conference on Distributed Computing Systems.

[96]  J. Hopcroft,et al.  Triangular Factorization and Inversion by Fast Matrix Multiplication , 1974 .

[97]  James Demmel,et al.  A Supernodal Approach to Sparse Partial Pivoting , 1999, SIAM J. Matrix Anal. Appl..

[98]  Ninghui Li,et al.  Injector: Mining Background Knowledge for Data Anonymization , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[99]  Yücel Saygin,et al.  Anonymization of Longitudinal Electronic Medical Records , 2012, IEEE Transactions on Information Technology in Biomedicine.

[100]  Juha Kärkkäinen,et al.  String Inference from Longest-Common-Prefix Array , 2022, ICALP.