Engineering a Lightweight External Memory Suffix Array Construction Algorithm

We describe an external memory suffix array construction algorithm based on constructing suffix arrays for blocks of text and merging them into the full suffix array. The basic idea goes back over 20 years and there has been a couple of later improvements, but we describe several further improvements that make the algorithm much faster. In particular, we reduce the I/O volume of the algorithm by a factor $$\mathcal {O}\!\left( {\log _\sigma n} \right) $$Ologσn. Our experiments show that the algorithm is the fastest suffix array construction algorithm when the size of the text is within a factor of about five from the size of the RAM in either direction, which is a common situation in practice.

[1]  Enno Ohlebusch,et al.  Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction , 2013 .

[2]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[3]  Juha Kärkkäinen,et al.  Fast BWT in small space by blockwise suffix sorting , 2007, Theor. Comput. Sci..

[4]  Paolo Ferragina,et al.  A Theoretical and Experimental Study on the Construction of Suffix Arrays in External Memory , 2001, Algorithmica.

[5]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[6]  Kunihiko Sadakane,et al.  A Linear-Time Burrows-Wheeler Transform Using Induced Sorting , 2009, SPIRE.

[7]  Peter Sanders,et al.  Better external memory suffix array construction , 2008, JEAL.

[8]  Juha Kärkkäinen,et al.  String Range Matching , 2014, CPM.

[9]  Juha Kärkkäinen,et al.  LCP Array Construction in External Memory , 2014, SEA.

[10]  Cristina Dutra de Aguiar Ciferri,et al.  External Memory Generalized Suffix and LCP Arrays Construction , 2013, CPM.

[11]  Travis Gagie,et al.  Lightweight Data Indexing and Compression in External Memory , 2009, Algorithmica.

[12]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[13]  Gonzalo Navarro,et al.  Alphabet Partitioning for Compressed Rank/Select and Applications , 2010, ISAAC.

[14]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[15]  Gonzalo Navarro,et al.  New Lower and Upper Bounds for Representing Sequences , 2011, ESA.

[16]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[17]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[18]  Enno Ohlebusch,et al.  Space-Efficient Construction of the Burrows-Wheeler Transform , 2013, SPIRE.

[19]  NongGe Practical linear-time O(1)-workspace suffix sorting for constant alphabets , 2013 .

[20]  Peter Sanders,et al.  Linear work suffix array construction , 2006, JACM.

[21]  Andrew Turpin,et al.  A Taxonomy of SuÆx Array Constru tion Algorithms , 2015 .

[22]  Juha Kärkkäinen,et al.  Lightweight Lempel-Ziv Parsing , 2013, SEA.

[23]  Juha Kärkkäinen,et al.  Fixed Block Compression Boosting in FM-Indexes , 2011, SPIRE.

[24]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[25]  Jeffrey Scott Vitter,et al.  Algorithms and Data Structures for External Memory , 2008, Found. Trends Theor. Comput. Sci..

[26]  Juha Kärkkäinen,et al.  Lempel-Ziv Parsing in External Memory , 2014, 2014 Data Compression Conference.

[27]  William F. Smyth,et al.  A taxonomy of suffix array construction algorithms , 2007, CSUR.

[28]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[29]  Maxime Crochemore String-Matching on Ordered Alphabets , 1992, Theor. Comput. Sci..

[30]  Vitaly Osipov,et al.  Inducing Suffix and Lcp Arrays in External Memory , 2013, ALENEX.