Compressed Spaced Suffix Arrays

As a first step in designing relatively-compressed data structures—i.e., such that storing an instance for one dataset helps us store instances for similar datasets—we consider how to compress spaced suffix arrays relative to normal suffix arrays and still support fast access to them. This problem is of practical interest when performing similarity search with spaced seeds because using several seeds in parallel significantly improves their performance, but with existing approaches we keep a separate linear-space hash table or spaced suffix array for each seed. We first prove a theoretical upper bound on the space needed to store a spaced suffix array when we already have the suffix array. We then present experiments indicating that our approach works even better in practice.

[1]  Alair Pereira do Lago,et al.  Lossless filter for multiple repetitions with Hamming distance , 2008, J. Discrete Algorithms.

[2]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[3]  M. Frith,et al.  Adaptive seeds tame genomic sequence comparison. , 2011, Genome research.

[5]  Roberto Grossi,et al.  Masking patterns in sequences: A new class of motif discovery with don't cares , 2009, Theor. Comput. Sci..

[6]  Jeremy Buhler,et al.  Designing multiple simultaneous seeds for DNA similarity search , 2004, J. Comput. Biol..

[7]  Gonzalo Navarro,et al.  Efficient Fully-Compressed Sequence Representations , 2012, Algorithmica.

[8]  Travis Gagie,et al.  Relative FM-Indexes , 2014, SPIRE.

[9]  Daniel G. Brown,et al.  A Survey of Seeding for Sequence Alignment , 2007 .

[10]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[11]  Giovanni Manzini,et al.  Better spaced seeds using Quadratic Residues , 2013, J. Comput. Syst. Sci..

[12]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[13]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[14]  Kenneth J. Supowit,et al.  Decomposing a Set of Points into Chains, with Applications to Permutation and Circle Graphs , 1985, Inf. Process. Lett..

[15]  Lucian Ilie,et al.  Seeds for effective oligonucleotide design , 2011, BMC Genomics.

[16]  Luís M. S. Russo,et al.  Succinct Gapped Suffix Arrays , 2011, SPIRE.

[17]  S. S. Ravi,et al.  An O(n log n) Lower Bound for Decomposing a Set of Points into Chains , 1989, Inf. Process. Lett..

[18]  Martin C. Frith,et al.  DisLex : a Transformation for Discontiguous Suffix Array Construction , 2009 .

[19]  Kunihiko Sadakane,et al.  Succinct de Bruijn Graphs , 2012, WABI.

[20]  Juha Kärkkäinen,et al.  Better Filtering with Gapped q-Grams , 2001, Fundam. Informaticae.

[21]  Gregory Kucherov,et al.  A unifying framework for seed sensitivity and its application to subset seeds , 2006, J. Bioinform. Comput. Biol..

[22]  Gonzalo Navarro,et al.  Optimal Lower and Upper Bounds for Representing Sequences , 2011, TALG.

[23]  Christina Boucher,et al.  Relative Select , 2015, SPIRE.

[24]  S. Nelson,et al.  BFAST: An Alignment Tool for Large Scale Genome Resequencing , 2009, PloS one.

[25]  Gonzalo Navarro,et al.  Relative Compressed Suffix Trees , 2015, ArXiv.

[26]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[27]  Alexander Zelikovsky,et al.  Bioinformatics Algorithms: Techniques and Applications , 2008 .

[28]  J. Ian Munro,et al.  Adaptive Data Structures for Permutations and Binary Relations , 2013, SPIRE.

[29]  Gonzalo Navarro,et al.  On compressing permutations and adaptive sorting , 2011, Theor. Comput. Sci..

[30]  Tetsuo Shibuya,et al.  An Index Structure for Spaced Seed Search , 2011, ISAAC.

[31]  Prosenjit Bose,et al.  Pattern Matching for Permutations , 1993, WADS.

[32]  Laurent Mouchard,et al.  On the number of elements to reorder when updating a suffix array , 2012, J. Discrete Algorithms.

[33]  Gonzalo Navarro,et al.  Alphabet-Independent Compressed Text Indexing , 2011, TALG.

[34]  Maxime Crochemore,et al.  The Gapped Suffix Array: A New Index Structure for Fast Approximate Matching , 2010, SPIRE.

[35]  Stephane Durocher,et al.  Untangled monotonic chains and adaptive range search , 2009, Theor. Comput. Sci..

[36]  Frédéric Boyer,et al.  Lossless Filter for Finding Long Multiple Approximate Repetitions Using a New Data Structure, the Bi-factor Array , 2005, SPIRE.

[37]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[38]  Lucian Ilie,et al.  SHRiMP2: Sensitive yet Practical Short Read Mapping , 2011, Bioinform..