PHONI: Streamed Matching Statistics with Multi-Genome References

Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this paper, we simplify their solution and make it streaming, at the cost of slowing it down slightly. This means that, first, we can compute the matching statistics of several long patterns (such as whole human chromosomes) in parallel while still using a reasonable amount of RAM; second, we can compute matching statistics online with low latency and thus quickly recognize when a pattern becomes incompressible relative to the database.

[1]  Christina Boucher,et al.  Efficient Construction of a Complete Index for Pan-Genomics Read Alignment , 2018, bioRxiv.

[2]  RytterWojciech Application of Lempel--Ziv factorization to the approximation of grammar-based compression , 2003 .

[3]  Tomasz Kociumaka,et al.  Practical Performance of Space Efficient Data Structures for Longest Common Extensions , 2020, ESA.

[4]  Fabio Cunial,et al.  Fast matching statistics in small space , 2018, SEA.

[5]  Stephan A. Frye,et al.  Rapid identification of pathogens, antibiotic resistance genes and plasmids in blood cultures by nanopore sequencing , 2020, Scientific Reports.

[6]  Hideo Bannai,et al.  Fully Dynamic Data Structure for LCE Queries in Compressed Space , 2016, MFCS.

[7]  Wojciech Rytter,et al.  Application of Lempel-Ziv factorization to the approximation of grammar-based compression , 2002, Theor. Comput. Sci..

[8]  B. Langmead,et al.  MONI: A Pangenomics Index for Finding MEMs , 2021, bioRxiv.

[9]  Marco Oliva,et al.  Portable nanopore analytics: are we there yet? , 2020, Bioinform..

[10]  Hiroshi Sakamoto,et al.  Rpair: Rescaling RePair with Rsync , 2019, SPIRE.

[11]  Anupama Sinha,et al.  Real-Time Selective Sequencing with RUBRIC: Read Until with Basecall and Reference-Informed Criteria , 2018 .

[12]  Gonzalo Navarro,et al.  Faster Compressed Suffix Trees for Repetitive Collections , 2016, ACM J. Exp. Algorithmics.

[13]  M. Schatz,et al.  Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED , 2020, Nature Biotechnology.

[14]  Gonzalo Navarro,et al.  Faster Compressed Suffix Trees for Repetitive Text Collections , 2014, SEA.

[15]  A. Moffat,et al.  Offline dictionary-based compression , 2000, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[16]  Enno Ohlebusch,et al.  Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction , 2013 .

[17]  I Tomohiro,et al.  Deterministic Sparse Suffix Sorting in the Restore Model , 2020, ACM Trans. Algorithms.

[18]  Hideo Bannai,et al.  Refining the r-index , 2018, Theor. Comput. Sci..

[19]  Lucian Ilie,et al.  The longest common extension problem revisited and applications to approximate string searching , 2010, J. Discrete Algorithms.

[20]  H. Sakamoto,et al.  Practical Random Access to SLP-Compressed Texts , 2020, SPIRE.

[21]  Abhi Shelat,et al.  The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[22]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[23]  Gonzalo Navarro,et al.  Compact Data Structures - A Practical Approach , 2016 .

[24]  Gonzalo Navarro,et al.  Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space , 2018, J. ACM.