Lower density selection schemes via small universal hitting sets with short remaining path length

Universal hitting sets are sets of words that are unavoidable: every long enough sequence is hit by the set (i.e., it contains a word from the set). There is a tight relationship between universal hitting sets and minimizers schemes, where minimizers schemes with low density (i.e., efficient schemes) correspond to universal hitting sets of small size. Local schemes are a generalization of minimizers schemes which can be used as replacement for minimizers scheme with the possibility of being much more efficient. We establish the link between efficient local schemes and the minimum length of a string that must be hit by a universal hitting set. We give bounds for the remaining path length of the Mykkeltveit universal hitting set. Additionally, we create a local scheme with the lowest known density that is only a log factor away from the theoretical lower bound.

[1]  M. Lothaire,et al.  Algebraic Combinatorics on Words: Index of Notation , 2002 .

[2]  Sebastian Deorowicz,et al.  KMC 2: Fast and resource-frugal k-mer counting , 2014, Bioinform..

[3]  Mihai Pop,et al.  Exploiting sparseness in de novo genome assembly , 2012, BMC Bioinformatics.

[4]  M. Lothaire Algebraic Combinatorics on Words , 2002 .

[5]  Michael Roberts,et al.  Reducing storage requirements for biological sequence comparison , 2004, Bioinform..

[6]  Ron Shamir,et al.  Compact Universal k-mer Hitting Sets , 2016, WABI.

[7]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[8]  Carl Kingsford,et al.  Asymptotically optimal minimizers schemes , 2018, bioRxiv.

[9]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[10]  Carl Kingsford,et al.  Sketching and Sublinear Data Structures in Genomics , 2019, Annual Review of Biomedical Data Science.

[11]  Michael Roberts,et al.  A Preprocessor for Shotgun Assembly of Large Genomes , 2004, J. Comput. Biol..

[12]  Paul Medvedev,et al.  Compacting de Bruijn graphs from sequencing data quickly and in low memory , 2016, Bioinform..

[13]  Szymon Grabowski,et al.  Sampling the Suffix Array with Minimizers , 2015, SPIRE.

[14]  Soloman W. Golomb,et al.  NONLINEAR SHIFT-REGISTER SEQUENCES , 1957 .

[15]  Carl Kingsford,et al.  Practical universal k-mer sets for minimizer schemes , 2019, bioRxiv.

[16]  Srinivas Aluru,et al.  A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases , 2017, bioRxiv.

[17]  Ron Shamir,et al.  Improving the performance of minimizers and winnowing schemes , 2017, bioRxiv.

[18]  Johannes Mykkeltveit,et al.  A proof of Golomb's conjecture for the de Bruijn graph , 1972 .

[19]  Dominique Perrin,et al.  Unavoidable Sets of Constant Length , 2004, Int. J. Algebra Comput..