String Sampling with Bidirectional String Anchors

The minimizers sampling mechanism is a popular mechanism for string sampling introduced independently by Schleimer et al. [SIGMOD 2003] and by Roberts et al. [Bioinf. 2004]. Given two positive integers w and k, it selects the lexicographically smallest length-k substring in every fragment of w consecutive length-k substrings (in every sliding window of length w + k − 1). Minimizers samples are approximately uniform, locally consistent, and computable in linear time. Although they do not have good worst-case guarantees on their size, they are often small in practice. They thus have been successfully employed in several string processing applications. Two main disadvantages of minimizers sampling mechanisms are: first, they also do not have good guarantees on the expected size of their samples for every combination of w and k; and, second, indexes that are constructed over their samples do not have good worst-case guarantees for on-line pattern searches. To alleviate these disadvantages, we introduce bidirectional string anchors (bd-anchors), a new string sampling mechanism. Given a positive integer `, our mechanism selects the lexicographically smallest rotation in every length-` fragment (in every sliding window of length `). We show that bd-anchors samples are also approximately uniform, locally consistent, and computable in linear time. In addition, our experiments using several datasets demonstrate that the bd-anchors sample sizes decrease proportionally to `; and that these sizes are competitive to or smaller than the minimizers sample sizes using the analogous sampling parameters. We provide theoretical justification for these results by analyzing the expected size of bd-anchors samples. As a negative result, we show that computing a total order ≤ on the input alphabet, which minimizes the bd-anchors sample size, is NP-hard. We also show that by using any bd-anchors sample, we can construct, in near-linear time, an index which requires linear (extra) space in the size of the sample and answers on-line pattern searches in near-optimal time. We further show, using several datasets, that a simple implementation of our index is consistently faster for on-line pattern searches than an analogous implementation of a minimizers-based index [Grabowski and Raniszewski, Softw. Pract. Exp. 2017]. Finally, we highlight the applicability of bd-anchors by developing an efficient and effective heuristic for top-K similarity search under edit distance. We show, using synthetic datasets, that our heuristic is more accurate and more than one order of magnitude faster in top-K similarity searches than the state-of-the-art tool for the same purpose [Zhang and Zhang, KDD 2020].

[1]  Solon P. Pissis,et al.  Bidirectional String Anchors: A New String Sampling Mechanism , 2021, ESA.

[2]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[3]  Heng Li,et al.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences , 2015, Bioinform..

[4]  Moshe Lewenstein,et al.  Suffix Trays and Suffix Trists: Structures for Faster Text Indexing , 2006, Algorithmica.

[5]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[6]  Richard Cole,et al.  Dictionary matching and indexing with errors and don't cares , 2004, STOC '04.

[7]  Chirag Jain,et al.  Weighted minimizer sampling improves long read mapping , 2020, bioRxiv.

[8]  Peter Sanders,et al.  Linear work suffix array construction , 2006, JACM.

[9]  Guoliang Li,et al.  Can we beat the prefix filtering?: an adaptive framework for similarity join and search , 2012, SIGMOD Conference.

[10]  Tomasz Kociumaka,et al.  Practical Performance of Space Efficient Data Structures for Longest Common Extensions , 2020, ESA.

[11]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[12]  Qin Zhang,et al.  MinSearch: An Efficient Algorithm for Similarity Search under Edit Distance , 2020, KDD.

[13]  Solon P. Pissis,et al.  Indexing Weighted Sequences: Neat and Efficient , 2020, Inf. Comput..

[14]  Tomasz Kociumaka Minimal Suffix and Rotation of a Substring in Optimal Time , 2016, CPM.

[15]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[16]  Djamal Belazzougui,et al.  Linear time construction of compressed text indices in compact space , 2014, STOC.

[17]  Guillaume Marçais,et al.  Improved design and analysis of practical minimizers , 2020, bioRxiv.

[18]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[19]  Anthony K. H. Tung,et al.  Efficient and Effective KNN Sequence Search with Approximate n-grams , 2013, Proc. VLDB Endow..

[20]  Sebastian Deorowicz,et al.  KMC 2: Fast and resource-frugal k-mer counting , 2014, Bioinform..

[21]  Carl Kingsford,et al.  Asymptotically optimal minimizers schemes , 2018, bioRxiv.

[22]  Ron Shamir,et al.  Compact Universal k-mer Hitting Sets , 2016, WABI.

[23]  Bin Wang,et al.  VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams , 2007, VLDB.

[24]  Costas S. Iliopoulos,et al.  Property Suffix Array with Applications in Indexing Weighted Sequences , 2020, ACM J. Exp. Algorithmics.

[25]  Vissarion Fisikopoulos An implementation of range trees with fractional cascading in C++ , 2011, ArXiv.

[26]  Haixun Wang,et al.  Asymmetric signature schemes for efficient exact edit similarity query processing , 2013, TODS.

[27]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[28]  Michael Roberts,et al.  Reducing storage requirements for biological sequence comparison , 2004, Bioinform..

[29]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[30]  Gad M. Landau,et al.  Text Indexing and Dictionary Matching with One Error , 2000, J. Algorithms.

[31]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[32]  Martin Farach-Colton,et al.  Optimal Suffix Tree Construction with Large Alphabets , 1997, FOCS.

[33]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[34]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[35]  Carl Kingsford,et al.  Practical universal k-mer sets for minimizer schemes , 2019, bioRxiv.

[36]  János Komlós,et al.  Storing a sparse table with O(1) worst case access time , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[37]  Paul Medvedev,et al.  Compacting de Bruijn graphs from sequencing data quickly and in low memory , 2016, Bioinform..

[38]  Huei-Jan Shyr,et al.  Disjunctive Languages and Codes , 1977, FCT.

[39]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[40]  János Komlós,et al.  Storing a sparse table with O(1) worst case access time , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[41]  Bonnie Berger,et al.  A randomized parallel algorithm for efficiently finding near-optimal universal hitting sets , 2020, bioRxiv.

[42]  Guy E. Blelloch,et al.  Parallel Range, Segment and Rectangle Queries with Augmented Maps , 2018, ALENEX.

[43]  Timothy M. Chan,et al.  Orthogonal range searching on the RAM, revisited , 2011, SoCG '11.

[44]  Kellogg S. Booth,et al.  Lexicographically Least Circular Substrings , 1980, Inf. Process. Lett..

[45]  Ambuj K. Singh,et al.  Efficient Index Structures for String Databases , 2001, VLDB.

[46]  Guillaume Marçais,et al.  Lower density selection schemes via small universal hitting sets with short remaining path length , 2020, RECOMB.

[47]  Srinivas Aluru,et al.  A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases , 2017, bioRxiv.

[48]  Wing-Kai Hon,et al.  Breaking a Time-and-Space Barrier in Constructing Full-Text Indices , 2009, SIAM J. Comput..

[49]  Chirag Jain,et al.  A fast adaptive algorithm for computing whole-genome homology maps , 2018, bioRxiv.

[50]  Ron Shamir,et al.  Improving the performance of minimizers and winnowing schemes , 2017, bioRxiv.

[51]  Irit Dinur,et al.  The importance of being biased , 2002, STOC '02.

[52]  Gonzalo Navarro,et al.  Space-Efficient Construction of Compressed Indexes in Deterministic Linear Time , 2016, SODA.

[53]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[54]  Srinivas Aluru,et al.  A Fast Adaptive Algorithm for Computing Whole-Genome Homology Maps , 2018 .

[55]  Wen-Syan Li,et al.  Top-k string similarity search with edit-distance constraints , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[56]  Gonzalo Navarro,et al.  Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space , 2018, J. ACM.

[57]  G. Marçais,et al.  Sequence-specific minimizers via polar sets , 2021, bioRxiv.

[58]  Mark de Berg,et al.  Computational geometry: algorithms and applications, 3rd Edition , 1997 .

[59]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[60]  C. Schensted Longest Increasing and Decreasing Subsequences , 1961, Canadian Journal of Mathematics.

[61]  Guoliang Li,et al.  A pivotal prefix based filtering algorithm for string similarity search , 2014, SIGMOD Conference.

[62]  LiGuoliang,et al.  Top-k Spatio-Textual Similarity Join , 2016 .

[63]  Szymon Grabowski,et al.  Sampled suffix array with minimizers , 2017, Softw. Pract. Exp..

[64]  Alistair Moffat,et al.  From Theory to Practice: Plug and Play with Succinct Data Structures , 2013, SEA.

[65]  Tomasz Kociumaka,et al.  String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure , 2019, STOC.

[66]  Gonzalo Navarro,et al.  Position-Restricted Substring Searching , 2006, LATIN.

[67]  Jin Wang,et al.  A unified framework for string similarity search with edit-distance constraint , 2016, The VLDB Journal.

[68]  Zhenglu Yang,et al.  Fast Algorithms for Top-k Approximate String Matching , 2010, AAAI.

[69]  Meng He,et al.  Indexing Compressed Text , 2003 .

[70]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[71]  Yakov Nekrich,et al.  Fast Preprocessing for Optimal Orthogonal Range Reporting and Range Successor with Applications to Text Indexing , 2020, ESA.

[72]  Simon J. Puglisi,et al.  Range Predecessor and Lempel-Ziv Parsing , 2016, SODA.

[73]  Beng Chin Ooi,et al.  Bed-tree: an all-purpose index structure for string similarity search based on edit distance , 2010, SIGMOD Conference.