PFP Data Structures

Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string $S$, it produces a dictionary $D$ and a parse $P$ of overlapping phrases such that $\mathrm{BWT} (S)$ can be computed from $D$ and $P$ in time and workspace bounded in terms of their combined size $|\mathrm{PFP} (S)|$. In practice $D$ and $P$ are significantly smaller than $S$ and computing $\mathrm{BWT} (S)$ from them is more efficient than computing it from $S$ directly, at least when $S$ consists of genomes from individuals of the same species. In this paper, we consider $\mathrm{PFP} (S)$ as a {\em data structure} and show how it can be augmented to support the following queries quickly, still in $O (|\mathrm{PFP} (S)|)$ space: longest common extension (LCE), suffix array (SA), longest common prefix (LCP) and BWT. Lastly, we provide experimental evidence that the PFP data structure can be efficiently constructed for very large repetitive datasets: it takes one hour and 54 GB peak memory for $1000$ variants of human chromosome 19, initially occupying roughly 56 GB.

[1]  Alistair Moffat,et al.  From Theory to Practice: Plug and Play with Succinct Data Structures , 2013, SEA.

[2]  Ely Porat,et al.  Locally Consistent Parsing for Text Indexing in Small Space , 2018, SODA.

[3]  Guilherme P. Telles,et al.  Inducing enhanced suffix arrays for string collections , 2017, Theor. Comput. Sci..

[4]  Travis Gagie,et al.  Prefix-free parsing for building big BWTs , 2018, Algorithms for Molecular Biology.

[5]  Andrew Tridgell,et al.  Efficient Algorithms for Sorting and Synchronization , 1999 .

[6]  Gonzalo Navarro,et al.  Wavelet trees for all , 2012, J. Discrete Algorithms.

[7]  Gonzalo Navarro,et al.  Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space , 2018, J. ACM.

[8]  Christina Boucher,et al.  Efficient Construction of a Complete Index for Pan-Genomics Read Alignment , 2018, bioRxiv.

[9]  Jesse D. Kornblum Identifying almost identical files using context triggered piecewise hashing , 2006, Digit. Investig..

[10]  Alexandru I. Tomescu,et al.  Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing , 2015 .

[11]  Gonzalo Navarro,et al.  Faster Repetition-Aware Compressed Suffix Trees based on Block Trees , 2019, SPIRE.

[12]  Gonzalo Navarro,et al.  Compact Data Structures - A Practical Approach , 2016 .

[13]  Tomasz Kociumaka,et al.  String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure , 2019, STOC.

[14]  Ge Nong,et al.  Practical linear-time O(1)-workspace suffix sorting for constant alphabets , 2013, TOIS.

[15]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[16]  Hideo Bannai,et al.  Refining the r-index , 2018, Theor. Comput. Sci..

[17]  Paolo Ferragina,et al.  A simple storage scheme for strings achieving entropy bounds , 2007, SODA '07.

[18]  Travis Gagie,et al.  Lightweight Data Indexing and Compression in External Memory , 2009, Algorithmica.

[19]  Christina Boucher,et al.  Matching Reads to Many Genomes with the r-Index , 2020, Journal of computational biology : a journal of computational molecular cell biology.

[20]  Ruth Timme,et al.  The Public Health Impact of a Publically Available, Environmental Database of Microbial Genomes , 2017, Front. Microbiol..