Constructing Antidictionaries in Output-Sensitive Space

A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y_1, y_2,...,y_k over an alphabet Σ, we are asked to compute the set M^ℓ_y_1#...#y_k of minimal absent words of length at most ℓ of word y=y_1#y_2#...#y_k, #∉Σ. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. This computation generally requires Ω(n) space for n=|y| using any of the plenty available O(n)-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when $||M^ℓ_y_1#...#y_N || =o(n), for all N ∊[1, k]. For instance, in the human genome, n ≈ 3 × 10^9 but ||M^12_y_1#...#y_k|| ≈ 10^6. We consider a constant-sized alphabet for stating our results. We show that all M^ℓ_y_1,...,M^ℓ_y_1#...#y_k can be computed in O(kn+∑^k_N=1||M^ℓ_y_1#...#y_N||) total time using O(MaxIn+MaxOut) space, where MaxIn is the length of the longest word in y_1,...,y_k and MaxOut=max{||M^ℓ_y_1#...#y_N||:N ∊[1, k]. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.

[1]  Antonio Restivo,et al.  Automata and Forbidden Words , 1998, Inf. Process. Lett..

[2]  Armando J. Pinho,et al.  Three minimal sequences found in Ebola virus genomes and absent from human DNA , 2015, Bioinform..

[3]  Maxime Crochemore,et al.  Alignment-free sequence comparison using absent words , 2018, Inf. Comput..

[4]  Maxime Crochemore,et al.  Minimal Absent Words in a Sliding Window and Applications to On-Line Pattern Matching , 2017, FCT.

[5]  Juha Kärkkäinen,et al.  Versatile Succinct Representations of the Bidirectional Burrows-Wheeler Transform , 2013, ESA.

[6]  Solon P. Pissis,et al.  Indexing Weighted Sequences: Neat and Efficient , 2020, Inf. Comput..

[7]  Solon P. Pissis,et al.  emMAW: computing minimal absent words in external memory , 2017, Bioinform..

[8]  Hideo Bannai,et al.  Computing DAWGs and Minimal Absent Words in Linear Time for Integer Alphabets , 2016, MFCS.

[9]  Solon P. Pissis,et al.  Linear-time computation of minimal absent words using suffix array , 2014, BMC Bioinformatics.

[10]  Takuya Takagi,et al.  Truncated DAWGs and Their Application to Minimal Absent Word Problem , 2018, SPIRE.

[11]  A. Restivo,et al.  Data compression using antidictionaries , 2000, Proceedings of the IEEE.

[12]  Juha Kärkkäinen,et al.  Parallel External Memory Suffix Sorting , 2015, CPM.

[13]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[14]  Raphael Clifford,et al.  String Processing and Information Retrieval, 15th International Symposium, SPIRE 2008 , 2008 .

[15]  Costas S. Iliopoulos,et al.  On avoided words, absent words, and their application to biological sequence analysis , 2017, Algorithms for Molecular Biology.

[16]  S. Muthukrishnan,et al.  Perfect Hashing for Strings: Formalization and Algorithms , 1996, CPM.

[17]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[18]  Jan Holub,et al.  DCA Using Suffix Arrays , 2008, Data Compression Conference (dcc 2008).

[19]  Hiroyoshi Morita,et al.  On the adaptive antidictionary code using minimal forbidden words with constant lengths , 2010, 2010 International Symposium On Information Theory & Its Applications.

[20]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[21]  Fabio Cunial,et al.  A Framework for Space-Efficient String Kernels , 2015, Algorithmica.

[22]  Maxime Crochemore,et al.  On Extended Special Factors of a Word , 2018, SPIRE.

[23]  Gonzalo Navarro,et al.  Improved antidictionary based compression , 2002, 12th International Conference of the Chilean Computer Science Society, 2002. Proceedings..