A grammar compressor for collections of reads with applications to the construction of the BWT

We describe a grammar for DNA sequencing reads from which we can compute the BWT directly. Our motivation is to perform in succinct space genomic analyses that require complex string queries not yet supported by repetition-based self-indexes. Our approach is to store the set of reads as a grammar, but when required, compute its BWT to carry out the analysis by using self-indexes. Our experiments in real data showed that the space reduction we achieve with our compressor is competitive with LZ-based methods and better than entropy-based approaches. Compared to other popular grammars, in this kind of data, we achieve, on average, 12% extra compression and require less working space and time.

[1]  External memory BWT and LCP computation for sequence collections with applications , 2018, WABI.

[2]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[3]  Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes. , 2017, Genome research.

[4]  En-Hui Yang,et al.  Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[5]  Gonzalo Navarro,et al.  Practical Random Access to SLP-Compressed Texts , 2020, SPIRE.

[6]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[7]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[8]  Susana Ladra,et al.  Approximate All-Pairs Suffix/Prefix Overlaps , 2010, CPM.

[9]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[10]  Hiroshi Sakamoto,et al.  Rpair: Rescaling RePair with Rsync , 2019, SPIRE.

[11]  Kunihiko Sadakane,et al.  A Linear-Time Burrows-Wheeler Transform Using Induced Sorting , 2009, SPIRE.

[12]  Eugene S. Schwartz,et al.  Generating a canonical prefix encoding , 1964, CACM.

[13]  Gonzalo Navarro,et al.  Self-Indexed Grammar-Based Compression , 2011, Fundam. Informaticae.

[14]  Meng He,et al.  Indexing Compressed Text , 2003 .

[15]  Giovanna Rosone,et al.  Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform , 2012, Bioinform..

[16]  Alberto Policriti,et al.  From LZ77 to the Run-Length Encoded Burrows-Wheeler Transform, and Back , 2017, CPM.

[17]  Gonzalo Navarro,et al.  Improved Grammar-Based Compressed Indexes , 2012, SPIRE.

[18]  Juha Kärkkäinen,et al.  Versatile Succinct Representations of the Bidirectional Burrows-Wheeler Transform , 2013, ESA.

[19]  Ge Nong,et al.  Linear Suffix Array Construction by Almost Pure Induced-Sorting , 2009, 2009 Data Compression Conference.

[20]  Gonzalo Navarro,et al.  On compressing and indexing repetitive sequences , 2013, Theor. Comput. Sci..

[21]  Gonzalo Navarro,et al.  A Grammar Compression Algorithm Based on Induced Suffix Sorting , 2018, 2018 Data Compression Conference.

[22]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[23]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[24]  Gonzalo Navarro,et al.  Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space , 2018, J. ACM.

[25]  Gonzalo Navarro,et al.  Compact Data Structures - A Practical Approach , 2016 .

[26]  Alistair Moffat,et al.  From Theory to Practice: Plug and Play with Succinct Data Structures , 2013, SEA.

[27]  Gad M. Landau,et al.  Random Access to Grammar-Compressed Strings and Trees , 2015, SIAM J. Comput..